Holy shit⦠this paper from MIT quietly explains how models can teach themselves to reason when theyāre completely stuck š¤Æ
The core idea is deceptively simple:
Reasoning fails because learning has nothing to latch onto.
When a modelās success rate drops to near zero, reinforcement learning stops working. No reward signal. No gradient. No improvement. The model isnāt ābad at reasoningā ā itās trapped beyond the edge of learnability.
This paper reframes the problem.
Instead of asking āHow do we make the model solve harder problems?ā
They ask: āHow does a model create problems it can learn from?ā
Thatās where SOAR comes in.
SOAR splits a single pretrained model into two roles:
⢠A student that attempts extremely hard target problems
⢠A teacher that generates new training problems for the student
But the constraint is brutal.
The teacher is never rewarded for clever questions, diversity, or realism.
Itās rewarded only if the studentās performance improves on a fixed set of real evaluation problems.
No improvement? No reward.
This changes the dynamics completely.
The teacher isnāt optimizing for aesthetics or novelty.
Itās optimizing for learning progress.
Over time, the teacher discovers something humans usually hard-code manually:
Intermediate problems.
Not solved versions of the target task.
Not watered-down copies.
But problems that sit just inside the studentās current capability boundary ā close enough to learn from, far enough to matter.
Hereās the surprising part.
Those generated problems do not need correct answers.
They donāt even need to be solvable by the teacher.
What matters is structure.
If the question forces the student to reason in the right direction, gradient signal emerges even without perfect supervision. Learning happens through struggle, not imitation.
Thatās why SOAR works where direct RL fails.
Instead of slamming into a reward cliff, the student climbs a staircase it helped build.
The experiments make this painfully clear.
On benchmarks where models start at absolute zero ā literally 0 successes ā standard methods flatline. With SOAR, performance begins to rise steadily as the curriculum reshapes itself around the modelās internal knowledge.
This is a quiet but radical shift.
We usually think reasoning is limited by model size, data scale, or training compute.
This paper suggests another bottleneck entirely:
Bad learning environments.
If models can generate their own stepping stones, many āreasoning limitsā stop being limits at all.
No new architecture.
No extra human labels.
No bigger models.
Just better incentives for how learning unfolds.
The uncomfortable implication is this:
Reasoning plateaus arenāt fundamental.
Theyāre self-inflicted.
And the path forward isnāt forcing models to think harder itās letting them decide what to learn next.