Benchmarking
Defining and Setting the Bar for Human Flourishing
π Benchmarks and Standards at Gloo AI
Gloo AI is committed to building transparent, value-aligned, and rigorous evaluation systems for large language models. Our benchmarks are not just technicalβthey are human, faith-aware, and centered on real-world use in communities.
β
What Weβre Building
Weβre developing a shared benchmark framework that is:
- Open: Designed to invite contribution from the wider faith-tech and AI ethics community
- Collaborative: Built alongside values-aligned partners, not just for internal use
- Multidimensional: Goes beyond accuracy to measure value alignment, relevance, and judgment
π§ What We Evaluate
Each model is scored against a broad range of criteria across three primary dimensions:
- Objective β factual correctness, reasoning quality
- Subjective β tone, empathy, value alignment
- Tangential β off-topic drift, misunderstanding, or indirect errors
π Our Testing Approach
- ~3,000+ curated evaluation prompts mapped to 7 dimensions of human flourishing
- Faith-specific QA and worldview-sensitive questions sourced from real communities
- Evaluated by LLMs and humans, with checks for model self-awareness and consistency
- Support for comparing open-source and proprietary models side-by-side
We run evaluations across top models like Gemini, DeepSeek, Mistral, Grok, and others to identify strengths, failure modes, and opportunities for alignment.
π Results That Matter
Our benchmark results are designed to:
- Inform partners choosing models for high-trust environments
- Guide internal training and fine-tuning decisions
- Track changes over time as the AI landscape evolves
- Provide transparent model performance reports for trust-building
π§° Tools You Can Use
Weβre also building tools for:
- Automating model evaluation at scale
- Embedding benchmarks into CI/CD pipelines for ML
- Visualizing results across multiple dimensions and LLM types
We're not just measuring what models can say β we're measuring what they should say, and how well they serve the communities we care about.
Updated 3 months ago