Benchmarking

Defining and Setting the Bar for Human Flourishing

πŸ“ Benchmarks and Standards at Gloo AI

Gloo AI is committed to building transparent, value-aligned, and rigorous evaluation systems for large language models. Our benchmarks are not just technicalβ€”they are human, faith-aware, and centered on real-world use in communities.


βœ… What We’re Building

We’re developing a shared benchmark framework that is:

  • Open: Designed to invite contribution from the wider faith-tech and AI ethics community
  • Collaborative: Built alongside values-aligned partners, not just for internal use
  • Multidimensional: Goes beyond accuracy to measure value alignment, relevance, and judgment

🧠 What We Evaluate

Each model is scored against a broad range of criteria across three primary dimensions:

  1. Objective – factual correctness, reasoning quality
  2. Subjective – tone, empathy, value alignment
  3. Tangential – off-topic drift, misunderstanding, or indirect errors

πŸ” Our Testing Approach

  • ~3,000+ curated evaluation prompts mapped to 7 dimensions of human flourishing
  • Faith-specific QA and worldview-sensitive questions sourced from real communities
  • Evaluated by LLMs and humans, with checks for model self-awareness and consistency
  • Support for comparing open-source and proprietary models side-by-side

We run evaluations across top models like Gemini, DeepSeek, Mistral, Grok, and others to identify strengths, failure modes, and opportunities for alignment.


πŸ“Š Results That Matter

Our benchmark results are designed to:

  • Inform partners choosing models for high-trust environments
  • Guide internal training and fine-tuning decisions
  • Track changes over time as the AI landscape evolves
  • Provide transparent model performance reports for trust-building

🧰 Tools You Can Use

We’re also building tools for:

  • Automating model evaluation at scale
  • Embedding benchmarks into CI/CD pipelines for ML
  • Visualizing results across multiple dimensions and LLM types
πŸ“£

We're not just measuring what models can say β€” we're measuring what they should say, and how well they serve the communities we care about.