Our new paper "Deriving neural scaling laws from the statistics of natural language"
arxiv.org/abs/2602.07488 lead by
@Fraccagnetta &
@AllanRaventos w/ Matthieu Wyart makes a breakthrough! We can predict data-limited neural scaling law exponents from first principles using the structure of natural language itself for the very first time!
If you give us two properties of your natural language dataset:
1) How conditional entropy of the next token decays with conditioning length.
2) How pairwise token correlations decay with time separation.
Then we can give you the exponent of the neural scaling law (loss versus data amount) through a simple formula!
The key idea is that as you increase the amount of training data, models can look further back in the past to predict, and as long as they do this well, the conditional entropy of the next token, conditioned on all tokens up to this data-dependent prediction time horizon, completely governs the loss! This gets us our simple formula for the neural scaling law!