Joined July 2011
47 Photos and videos
Pinned Tweet
1/ Let me chip in on the recent “which optimizer rules them all” discussion with a somewhat more moderate take, asking: What Schatten-p norm to use? Turns out the answer is regime dependent! Specifically, even when smooth in Schatten-∞, Muon is not necessarily the best choice.
2
2
24
3,065
1/ Let me chip in on the recent “which optimizer rules them all” discussion with a somewhat more moderate take, asking: What Schatten-p norm to use? Turns out the answer is regime dependent! Specifically, even when smooth in Schatten-∞, Muon is not necessarily the best choice.
2
2
24
3,065
I might also add that it should be possible to draw similar conclusions for Shampoo and PSGD parameterizations for certain ranges by using @varunneal's observation: x.com/varunneal/status/20665…

Replying to @evaninwords
in fact, whitening mode psgd w criteria c(P; g, v) = P[g,g] P^{-1}[v,v] can be replaced w "generalized" form c_k(P; g, v) = P[g,g] P^{-k}[v,v] which has fixed pt corresponding to spectral power (k-2)/(k 2) SoftMuon p = shampoo (p-1)/4 = Whitening PSGD 2(1 p)/(1-p)
1
6
157
Thomas Pethick retweeted
Result #32: @mihai673 has achieved a 30-step improvement over the old 2026/05/09 record by adding a SODA (Pethick et al. 2026)-style anchor towards init. It is unknown whether this technique can also improve the current record. 2/5
1
3
16
1,377
Thomas Pethick retweeted
May 27
SODA-AMUSE Gram PMuon is the recipe that wins consistently on 4/4 codex GPU instances. PMuon is a new idea: github.com/zzp1012/modded-na… AMUSE is Muon ScheduleFree: arxiv.org/abs/2605.22432v1 SODA is a wrapper that's more popular: arxiv.org/abs/2605.11172 Gram is a lesser known Gram-Newton-Schultz optimization: github.com/Dao-AILab/gram-ne…

1
3
11
1,885
1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.
2
21
129
9,392
6/ There are a lot of interesting questions one can ask from this perspective — please check out the paper! Paper: arxiv.org/pdf/2605.11172 Code: github.com/tmpethick/soda_co…

2
1
9
747
7/ Thanks to @CevherLIONS for supporting it and getting Roman Macháček on board, and @WanyunXie for scaling up the experiments and debugging runs together – it’s always a joy
1
6
648