🚨New work: Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking (
arxiv.org/abs/2509.21519)
In this work we propose a mathematical framework, named Li2, that explains the dynamics of grokking (i.e., delayed generalization) in 2-layer nonlinear networks. Specifically, it
1️⃣ Tells exactly what features will emerge during training.
2️⃣ Gives provable scaling laws of generalization/memorization, i.e. O(M log M) data samples suffice for generalization behavior of group arithmetic task of order M group.
3️⃣ Provides a more fundamental explanation for the popular empirical hypothesis that "generalization circuits learn slower but is more efficient than memorization circuits".
So how?