LLMs show incredible potential in complex math reasoning, but their progress is bottlenecked by a massive reliance on human-curated data. What happens when we run out of high-quality human annotations? Can models teach themselves from scratch? 🤔
Today, we present CPMöbius—new research from
@TsinghuaNLP (OpenBMB member) and collaborators: A novel collaborative "Coach-Player" paradigm that enables LLMs to self-evolve their reasoning capabilities in a completely data-free environment.
🤗 Paper:
huggingface.co/papers/2602.0…
📄 arXiv:
arxiv.org/abs/2602.02979
Why it matters:
1️⃣ Zero External Data Needed: No more expensive human engineering! CPMöbius breaks the data bottleneck through an autonomous dual-agent loop. A "Coach" LLM generates mathematical tasks, and a "Player" LLM learns to solve them using majority voting and self-training. 🔓
2️⃣ Collaborative, Not Adversarial: Unlike standard adversarial self-play, our Coach and Player are cooperative. The Coach applies dynamic difficulty filtering, ensuring generated questions have a 20%–80% success rate—keeping tasks challenging yet learnable, perfectly tailored to the Player's current skill level. 🤝
3️⃣ Progress-Driven Rewards: How do we stop the Coach from generating useless or overly complex questions? The Coach is rewarded purely by the Player's actual performance improvement ($\Delta$) on a validation set. The Coach uses the REINFORCE algorithm to update itself, meaning it only "wins" when the Player genuinely gets smarter! 📈
4️⃣ SOTA Unsupervised Reasoning: CPMöbius dominates existing unsupervised methods (like RENT and R-Zero). On Qwen2.5-Math-7B-Instruct, it boosts overall accuracy by 4.9 points and Out-Of-Distribution (OOD) accuracy by 5.4 points. It achieves consistent, massive gains across multiple baseline models. 🚀
CPMöbius proves that LLMs can collaboratively bootstrap their own intelligence, charting a highly scalable path toward true data-free self-evolution.
#AI #THUNLP #OpenBMB #LLM #ReinforcementLearning #MathReasoning #SelfEvolution