Joined July 2015
60 Photos and videos
Pinned Tweet
As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign šŸ’š
16
18
369
27,212
Vijay retweeted
May 22
After some mathematical rewrite, turns out all of transformer is a series of gemm epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).
17
128
1,208
132,117
Vijay retweeted
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).
15
103
685
197,768
Vijay retweeted
We built a kernel abstraction to rewrite the entire transformer stack as GEMM Epilogue kernels! Neural net architectures such as transformers consist entirely of matrix multiplications and elementwise nonlinearities such as RMSNorm, log sum exp, and gated activations. Fusing these elementwise nonlinearities into GEMMs in both the forward and backward passes allows us to make training and prefill as compute-bound as possible! Our kernel abstraction CODA is implemented in CuTeDSL, and by abstracting away the fixed prologue and main loop of the GEMM kernel, we expose an epilogue function where LLMs like Claude can easily implement elementwise nonlinearities in fusions approaching speed-of-light!
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).
1
24
180
19,467
Vijay retweeted
We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.
74
119
1,054
160,984
Vijay retweeted
new walk of shame: agent still working, but the cafe closed
263
188
5,459
603,490
Vijay retweeted
7 Nov 2025
[ENG SUB] how it feels to use eager pytorch in 2025
28
60
474
88,463
Vijay retweeted
Tomorrow: Blackwell Programming lecture by yours truly at Stanford CME213, Gates B3, 1:30–2:50 PM. Bring sharp questions.
6
9
136
7,514
Vijay retweeted
We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22Ɨ prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/FlashK…
45
183
1,811
213,823
Vijay retweeted
Meta released Avocado, they call it Muse Spark. It's not open source (a bit sad). Meta TBD lab rebuilt the entire pretraining stack in 9 months and reached similar capability with >10x less compute than Llama 4 Maverick. I still think infra is the real moat in AI labs. You can train models much faster with a good infra, and it allows researchers to experiment with many more ideas much more quickly.
41
30
647
54,031
Vijay retweeted
Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning. ai.meta.com/blog/introducing…
74
172
1,668
236,823
Vijay retweeted
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
744
1,192
10,371
4,551,614
Vijay retweeted
Very cool! I worked on this recently, and I actually used an identical approach early on. But I believe there is a significantly better approach - a **single** minimax rational iteration can beat 5 polynomial steps!
We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵
3
10
139
15,771
Vijay retweeted
Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/fla…
13
51
334
148,657
Vijay retweeted
In my opinion, here are the most important ideas of CuTe Layouts (arxiv.org/pdf/2603.02298) 🧵

3
26
250
15,645
Vijay retweeted
Mar 17
The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. x.com/_albertgu/status/20339…

The newest model in the Mamba series is finally here šŸ Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!
11
113
842
78,302
Vijay retweeted
Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.
48
45
519
134,928
Vijay retweeted
The release of the FA4 paper is a good opportunity to highlight our paper (link below) on automatically finding optimal pipelines and warp specialization (WS) for these kernels. Twill uses SMT solvers to derive the FA3 and 4 fwd pass pipelining and WS strategies mechanically 1/n
1
12
95
4,633
Vijay retweeted
FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000 repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2Ɨ to 3.2Ɨ speedups over Triton on compute-bound workloads. šŸ–‡ļø Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI
12
98
733
101,019
Vijay retweeted
SemiAnalysis x Fluidstack are kicking off GTC with A Full-Stack AI Infra GPU Hackathon Power to Prefill, Dirt to Decode Build with the best, win prizes, and hear @marksaroufim GPU MODE, @cHHillee Thinking Machines, Thomas Raoux OpenAI, @garywu Apply below luma.com/SAxFSHack
7
8
89
20,265