Filter
Exclude
Time range
-
Near
Ajitesh Shukla retweeted
which is an easy way to use psgd internals to get softmuon -1/3, 0, 1/5, 1/3, ... (k-2)/(k 2) <=> shampoo -1/3, -1/4, -1/5, -1/6, ... -1/(k 2)

im p sure u can use the internals of psgd to get these as well
2
5
487
Ajitesh Shukla retweeted
in fact, whitening mode psgd w criteria c(P; g, v) = P[g,g] P^{-1}[v,v] can be replaced w "generalized" form c_k(P; g, v) = P[g,g] P^{-k}[v,v] which has fixed pt corresponding to spectral power (k-2)/(k 2) SoftMuon p = shampoo (p-1)/4 = Whitening PSGD 2(1 p)/(1-p)
1
4
343
Ajitesh Shukla retweeted
but I'll go further and claim that whitening mode psgd corresponds to shampoo -1/3 (not -1/4)! This is a surprising result that I'm not sure has been shown elsewhere
2
4
245
I might also add that it should be possible to draw similar conclusions for Shampoo and PSGD parameterizations for certain ranges by using @varunneal's observation:

in fact, whitening mode psgd w criteria c(P; g, v) = P[g,g] P^{-1}[v,v] can be replaced w "generalized" form c_k(P; g, v) = P[g,g] P^{-k}[v,v] which has fixed pt corresponding to spectral power (k-2)/(k 2) SoftMuon p = shampoo (p-1)/4 = Whitening PSGD 2(1 p)/(1-p)
1
7
220
Ajitesh Shukla retweeted
PSGD in Megaprop happens by specifying the right-preconditioner and left-preconditioner. An optimizer "recipe". For eg: right-prec. with the feature gram retrieves FOOF. Composing with Muon retrieves Newton-Muon. For Newton Muon, in Megaprop, we pass: --matrix-optimizer muon --matrix-input-preconditioner feature_gram If you want a diagonal approximation of the feature_gram, we also pass. --matrix-input-preconditioner-approximation diag And more! (3/n)
1
1
9
375
Ajitesh Shukla retweeted
Megaprop's PSGD implementation calculates preconditioning matrices along with the gradient, collecting and communicating X.T @ X and dY.T @ dY at the same time we do the gradient on the weights: dY.T @ X, and has first-class support for diagonal/block-diag approx. support. (2/n)
1
5
19
1,025
PSGD ftw!
Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs! Megaprop is a fork of Megatron and TransformerEngine, with FSDP support for Muon, FOOF, KFAC, and Newton-Muon, with MuP support, for both width and depth... (1/n)
11
2,126
Replying to @anujsapte
The fixation on memory efficiency is pure propaganda. For frontier runs, first-order optimizer wall-clock and memory overheads were never the core concern. At sufficient scale, with proper optimization, both are manageable. The issue is that SOAP, Shampoo, and even PSGD never received the spotlight they deserved; and efforts to optimize their performance were far too limited. This is the consequence of a single hype-maxxed narrative attracted most attention of the community, at the expense of broader advancement.
5
Yes. PSGD is not a single algorithm, so confirmation is sort of not useful, so wanted to make the point.
2
460
psgd is the ebm equivalent of optimization
People think I am a shampoo fan from my enthusiasm, but I have converted to the PSGD religion, where you learn the preconditioner.
9
1,800