About EvalRig

EvalRig is an independent AI benchmarking tool. It tests LLM outputs against real enterprise tasks using deterministic rules and cross-validated ensemble judges.

Why It Exists

Most AI benchmarks test general knowledge or reasoning puzzles. Enterprise teams don't need models that solve math olympiad problems. They need models that triage support tickets, draft compliant emails, and analyze churn data. EvalRig tests those actual skills.

The goal: help teams pick the right model for the right task at the right price. Not the most expensive model. Not the model with the best marketing. The one that actually performs on their specific workflow.

How It Works

Skills

Each benchmark tests 4 real department workflows defined as skill.md files with structured inputs, evaluation criteria, and expected output patterns.

Models

9 models tested per benchmark: 3 tiers (Flagship, Mid, Light) across 3 providers (Anthropic, OpenAI, Google). Same prompt, same input, temperature 0.0.

Rules (60%)

Deterministic pattern matching validates required output elements. Binary checks, fully reproducible, free to run. No subjective interpretation.

Ensemble Judge (40%)

Three frontier models independently score each output. Provider recusal ensures no model judges its own provider's outputs. Median score wins.

Benchmark Series

The enterprise department series tests AI models on real workflows mapped to the Menlo VC AI market map. Each post covers one department with 4 skills, 9 models, and a recommendation matrix showing the cheapest model that matches flagship quality. See the full roadmap on the Benchmarks page.

Benchmarks engineered by Mads Srinivasan at Turanu AI.