Great post. I think there is much more to unpack behind the spectrum-preserving update rule in Pion (spherelab.ai/pion). Jianlin derives a different spectrum-preserving update rule from a steepest-descent perspective, leading to an alterantive orthogonalization (msign).
Steepest Descent on Manifolds: 6. Muon Double Rotation
kexue.fm/archives/11777
Introduces MuonR — a Muon variant that constrains updates to left & right rotation matrices. This preserves the singular value distribution of weights, providing a clean, elegant way to maintain training stability.
In PEFT-Arena (spherelab.ai/PEFT-Arena), we found a “free lunch” that improves both adaptation performance and preservation of general capabilities across PEFT methods, including full-parameter finetuning. The trick is simple: interpolate the weights and choose a midpoint between the fine-tuned model and the pretrained model. This can be useful in practice.
For Orthogonal Finetuning (OFT), the best interpolation is not linear. It should respect the orthogonal geometry, so the interpolation is performed within the orthogonal group (the computation is still very simple).
I find this project particularly elegant because it addresses a simple yet practically important question: should momentum be applied before or after the orthogonalization step? We study this question through the lens of spectral filtering and show that applying momentum before orthogonalization acts as a denoiser and can be provably better than applying momentum afterward.
🧠Why does Muon do momentum before orthogonalization?
✨Our key insight: momentum acts as a spectral filter for the matrix-valued gradient, yielding a more reliable update for the orthogonalization step.
📝Paper: arxiv.org/abs/2606.03899
🌐Project: yinleung.com/denoise-ortho