Vijay

Vijay

60 Photos and videos

Tweets

Pinned Tweet

Vijay @__tensorcore__

Jan 23

As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚

369

27,212

Tri Dao

Vijay retweeted

Tri Dao

@tri_dao

May 22

After some mathematical rewrite, turns out all of transformer is a series of gemm epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

Han Guo

@HanGuo97

May 21

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

128

1,208

132,117

Han Guo

Vijay retweeted

Han Guo

@HanGuo97

May 21

103

685

197,768

Jack Zhang

Vijay retweeted

Jack Zhang

@jcz42

May 21

We built a kernel abstraction to rewrite the entire transformer stack as GEMM Epilogue kernels! Neural net architectures such as transformers consist entirely of matrix multiplications and elementwise nonlinearities such as RMSNorm, log sum exp, and gated activations. Fusing these elementwise nonlinearities into GEMMs in both the forward and backward passes allows us to make training and prefill as compute-bound as possible! Our kernel abstraction CODA is implemented in CuTeDSL, and by abstracting away the fixed prologue and main loop of the GEMM kernel, we expose an epilogue function where LLMs like Claude can easily implement elementwise nonlinearities in fusions approaching speed-of-light!

Han Guo

@HanGuo97

May 21

180

19,467

Perplexity

Vijay retweeted

Perplexity

@perplexity_ai

May 6

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

119

1,054

160,984

resham ☻

Vijay retweeted

resham ☻@Reshusaur

Apr 28

new walk of shame: agent still working, but the cafe closed

263

188

5,459

603,490

tender

Vijay retweeted

tender

@tenderizzation

7 Nov 2025

[ENG SUB] how it feels to use eager pytorch in 2025

1:58

474

88,463

Alex Zhurkevich

Vijay retweeted

Alex Zhurkevich @cudagdb

Apr 23

Tomorrow: Blackwell Programming lecture by yours truly at Stanford CME213, Gates B3, 1:30–2:50 PM. Bring sharp questions.

136

7,514

Kimi.ai

Vijay retweeted

Kimi.ai

@Kimi_Moonshot

Apr 21

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/FlashK…

GitHub - MoonshotAI/FlashKDA: FlashKDA: high-performance Kimi Delta Attention kernels

FlashKDA: high-performance Kimi Delta Attention kernels - MoonshotAI/FlashKDA

github.com

183

1,811

213,823

Yuchen Jin

Vijay retweeted

Yuchen Jin

@Yuchenj_UW

Apr 8

Meta released Avocado, they call it Muse Spark. It's not open source (a bit sad). Meta TBD lab rebuilt the entire pretraining stack in 9 months and reached similar capability with >10x less compute than Llama 4 Maverick. I still think infra is the real moat in AI labs. You can train models much faster with a good infra, and it allows researchers to experiment with many more ideas much more quickly.

647

54,031

Shengjia Zhao

Vijay retweeted

Shengjia Zhao

@shengjia_zhao

Apr 8

Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning. ai.meta.com/blog/introducing…

172

1,668

236,823

Alexandr Wang

Vijay retweeted

Alexandr Wang

@alexandr_wang

Apr 8

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

744

1,192

10,371

4,551,614

Ji-Ha

Vijay retweeted

Ji-Ha @Ji_Ha_Kim

Mar 31

Very cool! I worked on this recently, and I actually used an identical approach early on. But I believe there is a significantly better approach - a **single** minimax rational iteration can beat 5 polynomial steps!

Jack Zhang

@jcz42

Mar 30

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

139

15,771

Alex Zhurkevich

Vijay retweeted

Alex Zhurkevich @cudagdb

Apr 3

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/fla…

trtllmgen moe oss kernels by aleozlx · Pull Request #2917 · flashinfer-ai/flashinfer

📌 Description Thanks to Nikita Korobov (@nekorobov), Julien Demouth, Louis Sugy, Jiqun Tu, Alexander Zhurkevich (@azhurkevich), David Clark, @PerkzZheng, Maxim Milakov, Tian Zheng(@Tom-Zheng ),...

github.com

334

148,657

Edward Z. Yang

Vijay retweeted

Edward Z. Yang @ezyang

Mar 27

In my opinion, here are the most important ideas of CuTe Layouts (arxiv.org/pdf/2603.02298) 🧵

250

15,645

Tri Dao

Vijay retweeted

Tri Dao

@tri_dao

Mar 17

The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. x.com/_albertgu/status/20339…

Albert Gu

@_albertgu

Mar 17

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

113

842

78,302

Vijay

Vijay @__tensorcore__

Mar 11

ai.meta.com/blog/meta-mtia-s…

Four MTIA Chips in Two Years: Scaling AI Experiences for Billions

Serving a wide range of AI models on a global scale, while maintaining the lowest possible costs, is one of the most demanding infrastructure challenges in the industry.

ai.meta.com

943

Anne Ouyang

Vijay retweeted

Anne Ouyang

@anneouyang

Mar 11

Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.

519

134,928

Rupanshu Soi

Vijay retweeted

Rupanshu Soi @rupanshusoi

Mar 6

The release of the FA4 paper is a good opportunity to highlight our paper (link below) on automatically finding optimal pipelines and warp specialization (WS) for these kernels. Twill uses SMT solvers to derive the FA3 and 4 fwd pass pipelining and WS strategies mechanically 1/n

4,633

PyTorch

Vijay retweeted

PyTorch

@PyTorch

Mar 5

FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000 repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2× to 3.2× speedups over Triton on compute-bound workloads. 🖇️ Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI

733

101,019

Dylan Patel

Vijay retweeted

Dylan Patel

@dylan522p

Feb 27

SemiAnalysis x Fluidstack are kicking off GTC with A Full-Stack AI Infra GPU Hackathon Power to Prefill, Dirt to Decode Build with the best, win prizes, and hear @marksaroufim GPU MODE, @cHHillee Thinking Machines, Thomas Raoux OpenAI, @garywu Apply below luma.com/SAxFSHack

SemiAnalysis x Fluidstack Hackathon · Luma

SemiAnalysis x Fluidstack is kicking off GTC with Power to Prefill, Dirt to Decode, Transformers to Transformers: A Full-Stack AI Infrastructure…

luma.com

20,265