To solve hard open math problems, we need AI models to train and self-improve indefinitely without more external data.
Humans can self-improve, so AI should as well if it imitates humans.
So we let AI also conjecture, prove, and also be self-guided with some tastes.
Self-play led to superhuman Go performance, why hasn’t it for LLMs?
In practice, long run self-play plateaus like RL. We study why this happens, and build a self-play algorithm that scales better. It solves as many problems with a 7B model as the pass@4 of a model 100x bigger.