Ilya said the quiet part out loud on Dwarkesh's pod, but most people still aren't processing what it means.
Here's what's actually happening inside AI labs.
Research teams have entire divisions that do nothing but create new RL training environments specifically designed to boost benchmark scores. They treat AIME, SWE-bench, and MMLU like standardized tests. The model practices 10,000 hours on competitive programming problems until every proof technique is at its fingertips.
Then it fails to fix a simple bug in production without introducing two new ones.
Sutskever used the perfect analogy. Student A grinds 10,000 hours of competitive programming. Memorizes every algorithm, every edge case, every proof technique. Becomes the #1 ranked competitive coder in the world. Student B practices 100 hours but has "it." Intuition. Taste. The ability to learn new things quickly.
Who has the better career? Student B. Current AI models are all Student A.
The benchmark gaming runs deeper than most realize. Studies have shown data contamination inflates model scores by 20-80% on popular benchmarks. The training-test boundary is porous. Models memorize answers rather than learn concepts. And when you control for contamination, much of what looks like intelligence is pattern-matching on seen data.
This explains the economic puzzle Ilya pointed to. Models score 100% on AIME 2025. They hit 70% on GDPval beating human professionals. Yet businesses still struggle to extract value. The benchmark performance says genius. The P&L says otherwise.
The sample efficiency gap tells you everything. A human teenager learns to drive any car after 10 hours. An AI model might need millions of examples and still fail on slight variations. A human learns a concept once and applies it everywhere. Models need to see the exact pattern thousands of times and still choke when the formatting changes slightly.
Sutskever's diagnosis: we're moving from the "age of scaling" (2020-2025) back to the "age of research." The belief that 100x more compute would transform everything is dying. His $3B company SSI is betting that the next breakthrough comes from solving generalization, not stacking more GPUs.
The labs know this.
That's why the benchmark arms race is accelerating.
It's easier to show impressive numbers than admit the fundamental approach might be plateauing.
Ilya is 100% correct .it's a pattern that keeps repeating
It's very clear with GPT5.2
Overfit the model to produce impressive looking benchmarks, have it excels in a few domains, but fall flat in many others.
There's not enough generalization, and even if there is, the model has been so heavily reinforced that it becomes buried .