Independent AI Benchmarks

Models, skills, and tasks tested head-to-head. Deterministic rules. Ensemble judges. No vendor bias.

Model Score Cost Avg Tokens

How Scoring Works

1
Generate. Each model gets the same prompt — skill instructions plus test input — and produces its complete output in one shot. Temperature 0.0.
2
Rules (60%). Deterministic pattern matching checks whether the output contains required elements. Binary, reproducible, free to run.
3
Ensemble Judge (40%). All three frontier models independently grade each output on relevance, completeness, and accuracy. Median score wins. No model grades its own homework unchecked.
View detailed benchmark reports

Stay ahead of the AI curve

New benchmarks, model insights, and AI engineering deep dives from Turanu AI.