In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.
We present Compute Optimal Tokenization! 🔡
Common in LLM scaling works stick to one tokenizer, sweeping data/model size.
But what happens when we control the tokenizer’s compression rate (bytes/token)?
Here we sweep tokenizers, params, and data across compute budgets: [1/N]