Is QK Norm fighting the Muon Optimizer in LLM training? 📉
I’ve been researching the relationship between Query-Key Normalization (QK Norm) and the Muon optimizer, and the preliminary results regarding "Dimension Collapse" are fascinating.
Here is what the data is showing so far:
1. The Phenomenon of Dimension Collapse
In an attention head with 128 dimensions, the model often only utilizes about 51–65 of them. The rest tend to collapse to zero or near-zero during training.
2. The Impact of Gamma (Scaling)
I ran three training runs to test this:
No QK Norm: Moderate dimension collapse.
QK Norm (Fixed Gamma): Worst collapse. It seems Muon and the norm might be "fighting" each other here.
QK Norm (Learnable Gamma): Best results with the fewest collapsed dimensions. The learned Gamma likely stretches those small dimensions so they remain relevant.
3. The Loss Paradox
Here is the interesting part: Despite the dimension collapse issues, using QK Norm always resulted in better loss compared to removing it entirely.
What’s Next?
Based on feedback from a PhD student at Peking University, my next experiment involves forcing attention heads to be orthogonal to each other to potentially prevent this collapse and reduce wasted compute.
Takeaway:
Don’t hoard your ideas. By sharing these early, raw findings, I’m getting rapid feedback that accelerates the research.
Has anyone else experienced this friction between Muon and Normalization layers?
Thank you
@novita_labs for providing compute for this research.
#AIResearch #LLM #MachineLearning #DeepLearning #MuonOptimizer #PublicLearning