The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000 real world professional tasks across 55 industries, all sourced from actual expert work. Not synthetic. Not multiple choice. Real deliverables, graded deterministically.
Key findings:
- Best agents score <50% on the easiest tier, <10% on the hardest
- 82% on Terminal-Bench drops to 23% on ALE-CLI eval with the same setup
- Hardest tier: most frontier agents hit 0% pass rate
- Spending more tokens doesn't improve results
- Each run tracks harness, model, pass rate, token usage, and cost
Harness vs. model:
- Best harness scores 24.0%, worst scores 19.1% (same model). That's a 4.9pp gap.
- Model choice drives more performance variation than the harness.
- Most efficient setup used 160M tokens for 39.6%. Least efficient burned 1,373M tokens for 40.5%.
Where agents break (Agents often say "Done. All checks pass." while the output is wrong)
- 47% of failures: wrong strategy or gave up early
- 31%: missing domain knowledge
- 22%: execution bugs and format errors
- 34% of tasks need GUI software, agents avoid it and hack CLI workarounds
Very excited to see a benchmark like this. Big kudos to everyone who contributed.