Whoa… what if we used this approach on humans? Ie train AI teachers to help humans and make that their reward function? Going to try this with openclaw and see what happens.
MIT just published a paper that quietly explains why LLM reasoning hits a wall and how to push past it.
The usual story is that models fail on hard problems because they lack scale, data, or intelligence.
This paper argues something much more structural: models stop improving because the learning signal disappears. Once a task becomes too difficult, success rates collapse toward zero, reinforcement learning has nothing to optimize, and reasoning stagnates. The failure isn’t cognitive, it’s pedagogical.
The authors propose a simple but radical reframing. Instead of asking how to make models solve harder problems, they ask how models can generate problems that teach them.
Their system, SOAR, splits a single pretrained model into two roles: a student that attempts extremely hard target tasks, and a teacher that generates new training problems. The catch is that the teacher is not rewarded for producing clever or realistic questions. It is rewarded only if the student’s performance improves on a fixed set of real evaluation problems. No improvement means zero reward.
That incentive reshapes everything.
The teacher learns to generate intermediate, stepping-stone problems that sit just inside the student’s current capability boundary. These problems are not simplified versions of the target task, and strikingly, they do not even require correct solutions.
What matters is that their structure forces the student to practice the right kind of reasoning, allowing gradient signal to emerge even when direct supervision fails.
The experimental results make the point painfully clear. On benchmarks where models start with zero success and standard reinforcement learning completely flatlines, SOAR breaks the deadlock and steadily improves performance.
The model escapes the edge of learnability not by thinking harder, but by constructing a better learning environment for itself.
The deeper implication is uncomfortable. Many supposed “reasoning limits” may not be limits of intelligence at all. They are artifacts of training setups that assume the world provides learnable problems for free.
This paper suggests that if models can shape their own curriculum, reasoning plateaus become engineering problems, not fundamental barriers.
No new architectures, no extra human data, no larger models. Just a shift in what we reward: learning progress instead of answers.