Sam Acquaviva

Sam Acquaviva

20 Photos and videos

Tweets

Sam Acquaviva

@Sam_Acqua

Jun 4

This is cool but disingenuous framing imo. All of the records were marginal, building on other records, so “out-outperformed all 1,016 other researchers” is a stretch lol I’m curious what this system’s performance would be if it couldn’t build on human records

Zhengyao Jiang

@zhengyaojiang

Jun 3

OpenAI ran a hiring challenge, but the top candidate was one they couldn’t hire: our autonomous research agent, Aiden. In Parameter Golf, Aiden ran for 22 days, and out-outperformed all 1,016 other researchers: 🧵 (1/8)

0:09

1,951

Sam Acquaviva

Sam Acquaviva

@Sam_Acqua

May 19

Great work from Oscar on scaling up flow models!

Oscar Davis

@osclsd

May 19

Replying to @osclsd

🚨 Before concluding: As noted by @Sam_Acqua and many others, we all ought to be very skeptical of Gen PPL as a metric, especially in isolation. ❌ It is actually a bit crazy that we have been using it for so long. Hence, the additional metrics, and the presence of several qualitative samples in the appendix. Please have a look yourselves to get a better understanding of the sample quality! 🔍 There's comparisons across SD/non-SD, number of NFEs, and others.

717

Sam Acquaviva

Sam Acquaviva

@Sam_Acqua

May 13

Flow models are a promising alternative to autoregression. But the current standard of evaluating flow models is broken. The reported 3x improvement in 1024-step PPL since 2023 is closer to 1.1x if you control for sample entropy. (1/12)

0:22

165

49,914

more replies

Sam Acquaviva

Sam Acquaviva

@Sam_Acqua

May 13

One note: the main results here use 1024 samples while the main flow model results are << 1024 samples. I chose this to make comparison with diffusions easier and to make the point about the framework, not a given paper. I love flows.

888

Sam Acquaviva

Sam Acquaviva

@Sam_Acqua

May 13

Thanks to valuable discussions from @Chramblin, @nmboffi, @ReeceShuttle, @akshayvegesna, Samir, and other friends :)

774

Keller Jordan

Sam Acquaviva retweeted

Keller Jordan

@kellerjordan0

May 12

Modded-NanoGPT optimization result #14 (2026/05/04): @Sam_Acqua has achieved a new record of 3150 steps (-60), by adding SOAP preconditioning before Muon orthogonalization for the MLP weights (SOAP-Muon).

5,767

Sam Acquaviva

Sam Acquaviva

@Sam_Acqua

May 8

👀

Tilde

@tilderesearch

May 7

👀 Aurora dropping tomorrow. 3175 steps → beating NanoGPT Track 3 SOTA by 50 steps. And it scales 🚀

9,070

Sam Acquaviva

Sam Acquaviva

@Sam_Acqua

May 8

my pr: github.com/KellerJordan/modd… aurora pr: github.com/KellerJordan/modd… in all fairness, although my submission is 25-50 steps more efficient, Aurora is more efficient in wall clock time.

New record: Track_3_optimization: Add SOAP preconditioning to MLPs (3150, -75 steps) by samacqua ·...

Contra Muon SOAP MLP pre-conditioning Before this change, all hidden 2D matrices (attention Q/K/V/O and MLP fc/proj) followed the same path: Nesterov momentum → Newton-Schulz orthogonalization →...

github.com

782

Sam Acquaviva

Sam Acquaviva

@Sam_Acqua

Apr 20

yes 100% I thought it could also be related to paired-head attention, but I ran using random pairings and also splitting the 6 heads into 2 groups of 3, and perf was ~equal. So yes, it is just a better balance btwn square-ish per-head norm

You Jiacheng @YouJiacheng

Apr 20

split → better high aspect ratio → worse somewhat balance?

535