Open source can plausibly beat the labs to AGI/ASI (for some definition of those terms), in the open, over the Internet. While there are still details to work out:
The flywheel is simply RL on many hard problems with verifiable outcomes.
* Writing verifiers, searching for excellent rollouts, and refining them are all embarrassingly parallel.
* There is more compute and software engineering talent outside of the labs than within them.
*
@PrimeIntellect has already built and validated the data generation half of the flywheel. Their community of compute volunteers has produced and gifted 2M verified reasoning traces (2.5x what DeepSeek used for R1-Distilled) already.
This unlocks bushels of low hanging fruit. Nearly anyone can accelerate the open source flywheel by working out any of the following:
1. Reasoning data selection
To date: LIMO and simplescaling 1.1 both find that ~1k carefully selected reasoning traces from R1 are sufficient to distill reasoning capability into Qwen2.5-Instruct models.
Careful selection to date has meant filtering for hard problems by filtering out problems the instruct model can already solve, and sampling for diversity. But there is more to do:
A. Ensure the reasoning traces are in-distribution wrt the student by filtering samples for low loss. Just about every post-training paper of note published in the last year finds that in-distribution data is a big deal, including
arxiv.org/pdf/2502.04194 and
arxiv.org/abs/2502.02797, this is almost certainly going to work. But it needs to be validated, thresholds need to be determined, and if there is a curriculum learning play (like easier samples first), of course that would be good to know.
B. Filtering may be done at token scale: with human data, if one masks high-loss tokens to spare the model OOD tokens (cf.
arxiv.org/abs/2501.14315), ~3/4 of the performance hit from human data goes away. This simple hack might well work for synthetic data, too. But the threshold will be important to find.
C. Prompts may be filtered for quality by looking at the variance in output reward, cf
arxiv.org/abs/2501.18578. It's unclear if this applies to traces with verified outcomes, but it probably at least applies to the rest of the post-training mix.
2. Distillation data refinement
A. Pruning
Reasoning traces are often long and sloppy, which is slow, expensive, and gratingly neurotic for the user at inference time. At least one paper (
arxiv.org/pdf/2501.12570) shows efficiency and accuracy benefits to pruning reasoning traces.
The targets here ought to be length (shorter is faster/cheaper), loss on the target model to distill (lower loss samples limit harm to base model's distribution), and accuracy (the final answer must remain correct).
B. Establish pruning limits / pivot token data mixing laws.
One risk of pruning reasoning traces is the loss of pivot tokens. Student models presumably need to see some of these!
So, the optimal pruning vs. pivot token mix needs to be determined. Determining where it is most beneficial to leave them (perhaps more difficult prompts) will also be high value.
B. Re-writing reasoning traces for low loss
Distillation reasoning traces are by definition off-policy for the student. As such, they are liable to be out of distribution, which is Bad.
We can fix this by re-writing high loss portions of reasoning traces with a FIM model, then verifying semantic equivalence and lower loss on the target model.
It may even be possible to do this very cheaply with
@jeremyphoward's ModernBert: use masked language modeling (like this:
x.com/jeremyphoward/status/1…) to swap out tough tokens, and use distance of the embeddings for the original vs. modified sentences to estimate semantic equivalence.
C. Reasoning condensation
<thinking> tokens are neat, but sometimes they're overkill, and many (e.g., Dario) do not find them aesthetically pleasing.
If and when is it constructive to integrate them into the answer with a more normal CoT (like Claude)?
This seems particularly promising in cases where (say) an instruct version of a student model can answer the question correctly (so you would filter the question out if tuning from that instruct model), but you are starting fresh with the base model and want more verifiably correct instruct data in your SFT or midtraining mix.
D. Persona reasoning
Right now, R1 and its distillates think in slop. It will be much nicer for many purposes if reasoning models <think> in character. e.g., if my System Prompt says the model is Captain Ahab, "OK, the User wants me to pretend to be Captain Ahab, I should care about this whale," is lame, "I'll follow him around the Horn, and around the Norway maelstrom, and around perdition's flames before I give him up" is much more interesting -> likely to yield better results any time vibe is important (like creative writing) or the persona's particular reasoning patters are at issue.
3. Determine prerequisites for reasoning distillation:
Qwen2.5-Instruct models, which were mid-trained on lots of synthetic data, require less than 1,000 examples to learn long chain of thought (cf LIMO & simplescaling).
True base models require more than 1,000 examples, but no more than 800k examples (used in R1-Distilled). Dialing in filtered and SFT and midtraining datasets is going to be a big deal.
Now that the gains for post-training small to medium sized models are huge again, it's an ideal time to dial this in. Tulu 3 is the data mix to beat with clearer prompts (
arxiv.org/abs/2501.18578), unencumbered data, and in-distribution rollouts of exemplary quality (cf
arxiv.org/abs/2502.01697,
arxiv.org/abs/2412.04305, and
arxiv.org/pdf/2411.08733 for ideas)
4. Demystify the unreasonable effectiveness of DoRA.
.
@winglian found that DoRA speeds up finetuning for reasoning by *a lot*. LoRA too --but not as much. Nobody knows why, but there's a quick and easy hypothesis to test: maybe FT is slower than LoRA is slower than DoRA because fewer weights are modified.
If so, we would expect
@DiLuo28's QuanTA (high-rank extremely sparse PeFT method) to be even better. This is extremely quick to test.
5. Build simple test-time compute levers.
The simplescaling folks (
@Muennighoff et al) validated and wrote up something
@_xjdr,
@voooooogel, and others tweeted:
One can simply insert pivot tokens and forbid end-of-thinking tokens to force the model to reason longer.
The obvious next step for scaling TTC economically is branching on pivot tokens. i.e., if the model wants to 'wait', spawn a branch where that is replaced by an end-of-thinking token (or vice versa).
The other obvious thing is running multiple streams in parallel (which is reportedly what O1-Pro is doing).
From there, you need some means of selecting or aggregating one's responses. Majority voting, fuzzy majority voting via clustering embeddings, shortest-of-n, and selection via reward model have all been shown to work in some settings. They ought to be refined and compared.