Wu Lin

Wu Lin

30 Photos and videos

Tweets

Pinned Tweet

Wu Lin @LinYorker

16 Jul 2024

#ICML2024 Can We Remove the Square-Root in Adaptive Methods? arxiv.org/abs/2402.03496 Root-free (RF) methods are better on CNNs and competitive on Transformers compared to root-based methods (AdamW) Removing the root makes matrix methods faster: Root-free Shampoo in BFloat16 /1

12,798

Wu Lin

Wu Lin @LinYorker

Jun 12

On one hand, it is essential to tune baseline methods well on a model. On the other hand, it may be better to avoid using a model/architecture that has been modified and optimized for a single method for 1.5 years.

Konstantin Mishchenko

@konstmish

Jun 11

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

1,169

Wu Lin

Wu Lin @LinYorker

May 3

We will make Shampoo/SOAP, including KL-Shampoo/KL-SOAP, faster. Our goal is to match Muon's runtime while maintaining Shampoo/SOAP's strong per-step performance. Stay tuned for new updates.

Lucas Nestler

@Clashluke

May 2

KL Shampoo and KL SOAP outperform their non-KL counterparts by learning the preconditioners compositionally, so that each stage corrects what remains after the last. Available in HeavyBall 3.1.1, with major PSGD stability backports.

3,062

Wu Lin

Wu Lin @LinYorker

May 28

Some initial steps to make Shampoo and SOAP faster arxiv.org/abs/2605.26327 We are working on further improvements.

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and...

Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decomposition. Because existing QR implementations require...

arxiv.org

Wu Lin

Wu Lin @LinYorker

Apr 30

Replying to @_arohan_

@_arohan_ Muon

237

more replies

Wu Lin

Wu Lin @LinYorker

Apr 30

also, short sided KL-Shampoo = short-sided Shampoo^2 = Muon

112

Wu Lin

Wu Lin @LinYorker

Apr 30

Wu Lin

Wu Lin retweeted

Wu Lin @LinYorker

Apr 19

Replying to @weijie444

@weijie444 Looks like a KFAC-based method with modern clipping? G(ZZ^T)^{-1} is known as the FOOF update arxiv.org/abs/2201.12250 while msgn() can be interpreted as "generalized (preconditioned) gradient norm clipping" arxiv.org/abs/2506.01913 .

Gradient Descent on Neurons and its Link to Approximate...

Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be...

arxiv.org

Weijie Su

@weijie444

Apr 16

We released "The Newton--Muon Optimizer" . We show that Muon is secretly an implicit Newton method, and use this insight to build a better one. 1/n Paper: arxiv.org/abs/2604.01472

5,644

Wu Lin

Wu Lin @LinYorker

3 Dec 2025

Within an information-geometric framework, we reconnect Shampoo/SOAP with both classical quasi-Newton ideas and Gaussian whitening, and develop practical methods that naturally handle tensor-valued weights in language model pre-training. arxiv.org/abs/2509.03378 opt-ml workshop

1,079

more replies

Wu Lin

Wu Lin @LinYorker

3 Dec 2025

This work builds on my ICML 2019 paper (with @MarkSchmidtUBC and @EmtiyazKhan), extending a variational Bayes-based geometric framework to modern NN optimization. It can be used to design methods for Bayesian inference, numerical optimization, and gradient-free optimization.

125

Wu Lin

Wu Lin @LinYorker

3 Dec 2025

This work is a joint effort with Scott C. Lowe, @f_dangel, @runame_, Zikun Xu, and @RogerGrosse. Stay tuned for more updates coming soon.

112

Runa Eschenhagen

Wu Lin retweeted

Runa Eschenhagen @runame_

20 Nov 2025

1/9 In practice, the Shampoo optimizer crucially relies on several heuristics. In our NeurIPS 2025 spotlight paper, we investigate the role of learning rate grafting and infrequent preconditioner updates in Shampoo by decomposing its preconditioner. arxiv.org/abs/2506.03595

13,251

Jonathan Lorraine

Wu Lin retweeted

Jonathan Lorraine @jonLorraine9

27 Nov 2024

Huge thanks to my amazing collaborators. This project was led by @juhanbae along with @LinYorker and @RogerGrosse. Supported (indirectly) by @Anthropic, @NVIDIA, @VectorInst, @UofTCompSci/@UofTArtSci/@UofT

532

Thomas Möllenhoff

Wu Lin retweeted

Thomas Möllenhoff @tmoellenhoff

10 Nov 2024

Are you LoRA fine-tuning LLMs and looking for easy ways to get improvements in accuracy? And also Bayesian uncertainty on top for free? Then check our recent work, accepted @neurips24fitml workshop! arxiv.org/abs/2411.04421

Variational Low-Rank Adaptation Using IVON

We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved...

arxiv.org

7,435

Wu Lin

Wu Lin @LinYorker

5 Oct 2024

Natural gradient descent: (steepest) gradient descent under a norm induced by the Fisher matrix yorkerlin.github.io/posts/20… Riemannian gradient descent (with geodesic retraction) : gradient descent in Riemannian normal coordinate

Frank Nielsen @FrnkNlsn

4 Oct 2024

At Maximum Likelihood Estimator: Key property: observed Fisher information = Fisher information 2nd order Taylor expansion of likelihood: - likelihood curvature = Fisher information - radius of osculating circle=Variance of MLE for large sample size

4,124

Wu Lin

Wu Lin @LinYorker

6 Sep 2024

Some hardcore theory people complain that "second-order" methods in DL do not have a superlinear convergence rate. At the same time, they are happy to consider SGD a first-order method with only a sublinear rate.

127