Seunghyun Seo

Seunghyun Seo

12 Photos and videos

Tweets

Kwangjun Ahn retweeted

Seunghyun Seo @SeunghyunSEO7

13 Nov 2025

sacled up to 12.7B dense, 5.5T tokens. - polynorm (optimized kernel) - grouped diff attn (their work) - parallel muonclip (adopt alltoall like mainhorse, essential, dion) - 80M batch it's still non-reasoning, also not moe though... keep pushing guys! arxiv.org/abs/2511.07464

Seunghyun Seo @SeunghyunSEO7

24 Aug 2025

they also dropped fsdp2 optimized muon. though they don't use muon for 2.6b dense model, i think it's just beginning and they are preparing larger one. they pipeline muon's comm-comp with calc flops and the code is neat. not sure if it's existing method. huggingface.co/Motif-Technol…

124

16,948

Kwangjun Ahn

Kwangjun Ahn @KwangjunA

29 Sep 2025

New improvement in Dion leads to a speedup that makes orthonormal updates (eg. Muon) more scalable for larger matrices. The trick: carefully using Newton-Schulz (on smaller matrices) as Dion's backend. Updates to our microsoft/dion codebase are coming soon---stay tuned!

2,467

Microsoft Research

Kwangjun Ahn retweeted

Microsoft Research

@MSFTResearch

11 Sep 2025

Join us on Sept 24 at 8 AM PT for Microsoft Research Forum Season 2 – a virtual series highlighting purposeful research and its real-world impact, from fundamental exploration to advancing AI responsibly, scaling innovation through products and open source, and driving positive change for society. Register now: msft.it/6011scy27

1:05

5,793

Andrej Karpathy

Kwangjun Ahn retweeted

Andrej Karpathy

@karpathy

3 Aug 2025

Replying to @jxbz

love the repo! clean code, good practices but still not overly over-engineered, triton kernels, well documented, simple reference implementations alongside optimized code. nice

203

30,845

elie

Kwangjun Ahn retweeted

elie

@eliebakouch

3 Aug 2025

Lot, lot of alpha here

Jeremy Bernstein @jxbz

3 Aug 2025

I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)

169

23,958

Jeremy Bernstein

Kwangjun Ahn retweeted

Jeremy Bernstein @jxbz

3 Aug 2025

Looks like extremely exciting and useful work by @KwangjunA, Byron Xu, Natalie Abreu, @JohnCLangford and @GagMagakyan github.com/microsoft/dion/ (2/2)

GitHub - microsoft/dion: Dion optimizer algorithm

Dion optimizer algorithm. Contribute to microsoft/dion development by creating an account on GitHub.

github.com

140

9,870

Jeremy Bernstein

Kwangjun Ahn retweeted

Jeremy Bernstein @jxbz

3 Aug 2025

332

75,209

Laker Newhouse

Kwangjun Ahn retweeted

Laker Newhouse @LakerNewhouse

21 Jul 2025

[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.

341

35,945

John Langford

Kwangjun Ahn retweeted

John Langford @JohnCLangford

20 Jul 2025

Apparently Dion is now being worked on for Torch Titan: github.com/pytorch/torchtita… :-)

[WIP][Optimizers] Unofficial implementation of DION optimizer - DIstributed OrthoNormal updates by...

This PR: 1 - Implements the new DION optimizer based on the paper: "Dion: Distributed Orthonormalized Updates" by Ahn et al. https://arxiv.org/abs/2504.05295 DION follows Muon reg...

github.com

Mikhail Parakhin

@MParakhin

18 Jul 2025

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

102

21,387

Mikhail Parakhin

Kwangjun Ahn retweeted

Mikhail Parakhin

@MParakhin

18 Jul 2025

424

68,667

Seungwook Han

Kwangjun Ahn retweeted

Seungwook Han

@seungwookh

16 Jul 2025

But actually this is the og way of doing it and should stop by E-2103 to see @jxbz and Laker Newhouse whiteboard the whole paper.

Jeremy Bernstein @jxbz

16 Jul 2025

Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies

8,310

Konstantin Mishchenko

Kwangjun Ahn retweeted

Konstantin Mishchenko

@konstmish

15 Jul 2025

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.

140

13,639

Gagik Magakyan

Kwangjun Ahn retweeted

Gagik Magakyan @GagMagakyan

15 Jul 2025

If you are at ICML 2025, come check out our oral presentation about the non-convex theory of Schedule Free SGD in the Optimization session tomorrow! This work was done with amazing collaborators @KwangjunA and @AshokCutkosky.

486

Kwangjun Ahn

Kwangjun Ahn @KwangjunA

15 Jul 2025

ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!

3,254

Jeremy Bernstein

Kwangjun Ahn retweeted

Jeremy Bernstein @jxbz

21 Jun 2025

Replying to @eliebakouch @noahamsel @gowerrobert

and also Dion by @KwangjunA, @JohnCLangford et al arxiv.org/abs/2504.05295

Dion: Distributed Orthonormalized Updates

Orthonormalized updates accelerate training, improve stability, and enable robust hyperparameter transfer, but existing methods like Muon rely on dense matrix operations that clash with sharded...

arxiv.org

1,077

You Jiacheng

Kwangjun Ahn retweeted

You Jiacheng @YouJiacheng

12 Jun 2025

Replying to @kvfrans @depen_morwani @KwangjunA @vyasnikhil96

Oh I found them: linear warmup and then constant

1,798

Kwangjun Ahn

Kwangjun Ahn @KwangjunA

23 Apr 2025

ICLR: @edward_s_hu and I will be presenting our work "The Belief State Transformer" at the 1st poster session. (#269) Please come check it out! (github: github.com/microsoft/BST)

GitHub - microsoft/BST

Contribute to microsoft/BST development by creating an account on GitHub.

github.com

692

John Langford

Kwangjun Ahn retweeted

John Langford @JohnCLangford

21 Apr 2025

The Belief State Transformer edwardshu.com/bst-website/ is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: microsoft.com/en-us/research… and @mgostIH for further discussion.

106

16,313

John Langford

Kwangjun Ahn retweeted

John Langford @JohnCLangford

23 Sep 2024

New reqs for low to high level researcher positions: jobs.careers.microsoft.com/g… , jobs.careers.microsoft.com/g…, jobs.careers.microsoft.com/g…, jobs.careers.microsoft.com/g…, with postdocs from Akshay and @MiroDudik x.com/MiroDudik/status/18367… . Please apply or pass to those who may :-)

108

36,141

John Langford

Kwangjun Ahn retweeted

John Langford @JohnCLangford

23 Sep 2024

Last year, we had offers accepted from @KwangjunA, @riashatislam, @Tea_Pearce , @pratyusha_PS while Akshay and @MiroDudik hired 7(!) postdocs.

John Langford @JohnCLangford

23 Sep 2024

4,222