varun

varun

Users
Tweets

Ajitesh Shukla retweeted

varun

@varunneal

21h

which is an easy way to use psgd internals to get softmuon -1/3, 0, 1/5, 1/3, ... (k-2)/(k 2) <=> shampoo -1/3, -1/4, -1/5, -1/6, ... -1/(k 2)

Omead Pooladzandi

@HessianFree

May 26

Replying to @varunneal @nilinabra

im p sure u can use the internals of psgd to get these as well

487

varun

Ajitesh Shukla retweeted

varun

@varunneal

21h

Replying to @varunneal @evaninwords

in fact, whitening mode psgd w criteria c(P; g, v) = P[g,g] P^{-1}[v,v] can be replaced w "generalized" form c_k(P; g, v) = P[g,g] P^{-k}[v,v] which has fixed pt corresponding to spectral power (k-2)/(k 2) SoftMuon p = shampoo (p-1)/4 = Whitening PSGD 2(1 p)/(1-p)

343

varun

Ajitesh Shukla retweeted

varun

@varunneal

21h

Replying to @varunneal @evaninwords

but I'll go further and claim that whitening mode psgd corresponds to shampoo -1/3 (not -1/4)! This is a surprising result that I'm not sure has been shown elsewhere

245

Thomas Pethick

Thomas Pethick @tmpethick

Replying to @tmpethick @konstmish @kellerjordan0 @jxbz @leloykun @_arohan_ @gowerrobert @vyasnikhil96 @SebastienBubeck @JohnCLangford @ed_gorbunov @tonysilveti @DimitrisPapail @damekdavis @bremen79

I might also add that it should be possible to draw similar conclusions for Shampoo and PSGD parameterizations for certain ranges by using @varunneal's observation:

varun

@varunneal

21h

Replying to @varunneal @evaninwords

220

Plugyawn

Ajitesh Shukla retweeted

Plugyawn

@plugyawn

23h

PSGD in Megaprop happens by specifying the right-preconditioner and left-preconditioner. An optimizer "recipe". For eg: right-prec. with the feature gram retrieves FOOF. Composing with Muon retrieves Newton-Muon. For Newton Muon, in Megaprop, we pass: --matrix-optimizer muon --matrix-input-preconditioner feature_gram If you want a diagonal approximation of the feature_gram, we also pass. --matrix-input-preconditioner-approximation diag And more! (3/n)

375

Plugyawn

Ajitesh Shukla retweeted

Plugyawn

@plugyawn

23h

Megaprop's PSGD implementation calculates preconditioning matrices along with the gradient, collecting and communicating X.T @ X and dY.T @ dY at the same time we do the gradient on the weights: dY.T @ X, and has first-class support for diagonal/block-diag approx. support. (2/n)

1,025

Plugyawn

Plugyawn

@plugyawn

23h

PSGD ftw!

Plugyawn

@plugyawn

23h

Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs! Megaprop is a fork of Megatron and TransformerEngine, with FSDP support for Muon, FOOF, KFAC, and Newton-Muon, with MuP support, for both width and depth... (1/n)

2,126

James MMatrix

James MMatrix

@JamesWhate89993

Jun 13

Replying to @anujsapte

The fixation on memory efficiency is pure propaganda. For frontier runs, first-order optimizer wall-clock and memory overheads were never the core concern. At sufficient scale, with proper optimization, both are manageable. The issue is that SOAP, Shampoo, and even PSGD never received the spotlight they deserved; and efforts to optimize their performance were far too limited. This is the consequence of a single hype-maxxed narrative attracted most attention of the community, at the expense of broader advancement.

rohan anil

rohan anil

@_arohan_

Jun 13

Replying to @giffmana @MillionInt

Yes. PSGD is not a single algorithm, so confirmation is sort of not useful, so wanted to make the point.

460

Alexi Gladstone

Alexi Gladstone

@AlexiGlad

Jun 13

psgd is the ebm equivalent of optimization

rohan anil

@_arohan_

Jun 12

People think I am a shampoo fan from my enthusiasm, but I have converted to the PSGD religion, where you learn the preconditioner.

1,800