Continuing Tutorial II for Physics of Language Models.
We often trust large-scale results simply because they are large; but once noise is removed, the synthetic pretrain playground starts to push back — hard!
The second video (Part 4.1b, 90 minutes) makes this pushback concrete.
From it, I derive 20 architectural principles, organized into 12 result blocks.
Two highlights that consistently surprise even experienced readers:
Result 2.1 (new):
"Why Canon layers actually work."
Not because of multi-token attention — that explanation only applies to the first layer.
The real mechanism is how Canon reshapes hierarchical learning across depth.
Result 11:
"Why linear models reason 4× shallower than Transformers."
This has nothing to do with memory size —
it is a structural failure shared by nearly all linear architectures.
In Result 12, I show which of these principles already emerge at academic-scale pretraining (1.3B / 100B) —
with orders-of-magnitude lower cost and far cleaner signals than many real-life large-scale runs.
The remaining principles do not disappear; they only emerge when scaling to 8B / 1T, which I will show in the third video (Part 4.2).
⏮️ Previous: Part 4.1a — methodology & playground design
▶️ This: Part 4.1b — architectural principles from the playground
🔜 Next: Part 4.2 — when the playground reshapes real-life pretraining
ALT What Emerges from the Playground — Canon Layers & Architectural Principles