Independent AI
Benchmarks

Models, skills, and tasks tested head-to-head. Deterministic rules. Ensemble judges. No vendor bias.

3 Models Tested
4 Skills
12 Total Runs

Latest Evals

Frontier Model Evals

Claude Opus 4.6 scores highest at 0.825

Three frontier models, four skills, ensemble judges. Claude Opus led at 0.825 but cost $1.03. GPT-5.4 scored 0.814 for $0.48. Gemini was cheapest at $0.31 but the only model to miss a pass threshold. The gap between first and third: 0.057 points.

Read the full report →

How Scoring Works

1
Generate. Each model gets the same prompt — skill instructions plus test input — and produces its complete output in one shot. Temperature 0.0.
2
Rules (60%). Deterministic pattern matching checks whether the output contains required elements. Binary, reproducible, free to run.
3
Ensemble Judge (40%). Three frontier models independently grade each output on relevance, completeness, and accuracy. Provider recusal prevents self-grading bias. Median score wins.

Stay ahead of the AI curve

New benchmarks, model insights, and AI engineering deep dives.