New FinePhrase result: the best synthetic-to-real ratio for pretraining isn't 50/50.
Quick context: FinePhrase is our open 486B-token synthetic pretraining dataset. We take FineWeb-Edu web text, rephrase it with a small 1.7B model (SmolLM2) into four structured formats (FAQs, math problems, tables, tutorials), and then train on a mix of original and synthetic data. The whole recipe came out of 90 controlled pretraining experiments.
The new question we tackled: how much of that mix should actually be synthetic? We swept the synthetic fraction from 10% to 90% for each format. Every format's optimum sits higher than the uniform 50/50, and it's format-dependent: tables peak at 70% synthetic, math at 80%, FAQ and tutorials at 60%. The curves climb to their peak and then plateau rather than collapsing, so there's a wide safe band and no sign of the "too much synthetic = model collapse" failure mode.
This also sets a new state of the art among synthetic pretraining data. Our best config (tables at 70% synthetic) is 31% better and reaches the same quality 3.2x faster than REWIRE, the strongest rephrasing baseline, which used a 70B-parameter model. We get there with a 1.7B rephraser that also generates tokens roughly 30x cheaper.
A caveat: these results are at small scale (1.7B parameters, 21B training tokens) so might not transfer to larger training runs.
Read the updated playbook:
huggingface.co/spaces/Huggin…