Benchmarks — EvalRig

Frontier Model Evals 2026-04-19

Three Frontier Models, Five Agentic Coding Skills

Opus 4.7, GPT-5.4, and Gemini 3.1 Pro benchmarked on real agentic coding tasks with iterative test execution. All three pass every skill at 100%. The real story is cost and tool efficiency.

3 Models

5 Skills

45 Runs

Best Value Gemini 3.1 Pro 0.952 avg, $2.98

Best Overall Claude Opus 4.7 0.979 avg

Read full report →

Customer Success 2026-03-16

Customer Success Doesn't Need Opus

Nine models across three tiers on ticket triage, complaint response, knowledge base generation, and churn analysis. The most expensive model scored the lowest.

9 Models

4 Skills

108 Runs

Best Value GPT-4.1 Mini 0.763 avg, $0.45/qp

Best Overall Gemini 3.1 Pro 0.852 avg

Read full report →

Upcoming

Sales Lead scoring, outreach drafting, pipeline analysis, competitor intel

Marketing Campaign copy, audience segmentation, content calendar, performance analysis

HR Job descriptions, resume screening, onboarding plans, policy Q&A

IT / DevOps Incident triage, runbook generation, capacity planning, config review

Legal Contract review, compliance checks, risk assessment, policy drafting

Finance Variance analysis, forecast modeling, expense audit, report generation