levi

levi

200 Photos and videos

Tweets

Pinned Tweet

levi

@levidiamode

Mar 27

Day 83/365 of GPU Programming Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation. A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post. @jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles. The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA. Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

levi

@levidiamode

Mar 26

Day 82/365 of GPU Programming Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge. I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic. I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations. Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd. Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

146

1,432

117,416

levi

levi

@levidiamode

162/365 of GPU Programming Spending most of today reviewing the FP4 RaZeR and SMC-SD repos from yesterday as well as finalizing/submitting the camera ready version of my workshop paper. Maybe obvious to other but as a research novice, I've mainly been trying to understand how to best incorporate feedback from reviewers into the revision and how to best utilize the additional page you get post-review.

levi

@levidiamode

22h

161/365 of GPU Programming Was listening to an interview with @mohsaied (one of the co-founders of Makora) where he was walking through some of the things they're working on in the automated kernel generation space when he started explaining some current work on making NVFP4 more efficient by remapping redundant zeros and improving speculative decoding with sequential Monte Carlo. Since I'm quite interested in FP4 at the moment and am increasingly looking at inference engineering, I'm gonna spend some time today reading through those research projects and their corresponding opensource code.

572

levi

levi

@levidiamode

- Link to RaZeR repo: github.com/abdelfattah-lab/N… - Link to SMSCD repo: github.com/abdelfattah-lab/s…

levi

levi

@levidiamode

22h

levi

@levidiamode

Jun 12

160/365 of GPU Programming Taking a little break from studying FlashAttention to continue with CS336 today. Lecture 10 is on inference and starts with a nice encapsulation of why inference is so important, continues on with what metrics matter in inference (TTFT, latency, throughput), then goes through how you can calculate arithmetic intensities for various operations before breaking down compute vs memory bound, KV cache, prefill vs decode, GQA, MLA, CSA, DSA, HCA, linear attention, QAT, PTQ, speculative decoding, continuous batching and paged attention. Even if you're familiar with most of these concepts, it's a nice review of the inference stack and touches on some of the tradeoffs inherent in inference optimizations.

109

4,273

levi

levi

@levidiamode

22h

- Link to interview: youtube.com/watch?v=ukzACWrk… - Link to NVFP4 paper: arxiv.org/pdf/2501.04052 - Link to Monte Carlo Spec Decoding paper: arxiv.org/pdf/2604.15672

How Makora Generates CUDA Kernels That Beat Hand-Tuned Code |...

Kimbo Chen sits down with Mohamed Abdelfattah, Co-Founder & CSO at ...

youtube.com

421

levi

levi

@levidiamode

Jun 12

levi

@levidiamode

Jun 11

159/365 of GPU Programming Doing a regular review today just looking at what I've been studying the past few weeks and looking at questions/confusions I wrote down along the way. Will focus on parallelism/sharding strategies, FlashAttention v1-v4 and take another look at scaling laws tomorrow.

6,959

levi

levi

@levidiamode

Jun 12

Link to lecture 10: youtube.com/watch?v=EfM546A7…

231

Charles 🎉 Frye

levi retweeted

Charles 🎉 Frye

@charles_irl

Jun 11

Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been developing improvements for inference performance and upstreaming them. This blog post explains those contributions. modal.com/blog/flash-attenti…

194

17,020

levi

levi

@levidiamode

Jun 11

levi

@levidiamode

Jun 10

158/365 of GPU Programming I think I understand the high level differences between the FlashAttention 2, 3 and 4 forward passes now but have yet to grasp the lower level details of each algorithm and the backward passes. Need to spend some time learning more about cooperative thread arrays, DSMEM, emulation of the exp function, rowmax/rowsum, warp partitioning and specializatio, WGMMA, asynchrony, producer/consumer pipelines, etc

4,700

levi

levi

@levidiamode

Jun 10

levi

@levidiamode

Jun 9

157/365 of GPU Programming Another FlashAttention4 resource that's been really helpful for me is the talk @charles_irl gave last year on GPU Mode (basically the lecture version of We reverse-engineered Flash Attention 4 blog post which is awesome as well) about FA4's code and the evolution to FA4. Really cool how the Modal team broke down the code before the paper release and made educated inferences about the forward pass. Wish more people did deeper code dissections like this!

108

4,993

levi

levi

@levidiamode

Jun 9

levi

@levidiamode

Jun 8

156/365 of GPU Programming Giving FlashAttention 4 a read today and trying to get a sense of the evolution of FlashAttention in its forward and backward passes over the four generations. I've seen @tedzadouri's GPU Mode talk mentioned quite a few times recently and have to echo that it's such a good perspective into what the thought process was behind FA4 and the steps to get there. @marksaroufim also does a great job interleaving the talk with pointed questions that help uninitiated learners like me get a better grasp of the concepts. Also want to highlight @drisspg's talk on FlexAttention which had the animation/visualization of the softmax/MMA pipelining in FA4.

0:26

7,315

levi

levi

@levidiamode

Jun 9

- Link to talk: youtube.com/watch?v=ZIEq-WTq… - Link to blog post: modal.com/blog/reverse-engin…

How FlashAttention 4 Works

Speaker: Charles FryeFrom the Modal team: https://modal.com/blog/...

youtube.com

538

levi

levi

@levidiamode

Jun 8

0:26

levi

@levidiamode

Jun 7

155/365 of GPU Programming Looking at FlashAttention 3 today and learning more about its hopper specific use of WGMMA (Warpgroup Matrix Multiply-Accumulate), different types of inter-warpgroup overlapping and the unique challenges of FP8 vs FP16 w.r.t datatype conversion and k-major constraints on FP8 WGMMAs. Mainly trying to process the paper and reviewer comments today before going through the code.

5,948

levi

levi

@levidiamode

Jun 8

- Link to paper: arxiv.org/pdf/2603.05451 - Link to 1st author talk: youtube.com/watch?v=kPKKvBqQ… - Link to FlexAttention talk: youtube.com/watch?v=JiCJJDQN…

323

levi

levi

@levidiamode

Jun 7

levi

@levidiamode

Jun 6

154/365 of GPU Programming There's this really nice talk by Tri Dao that walks through some of the behind-the-scenes of FlashAttention 1 and FlashAttention 2. Learned about the lore of Young-Jun Ko and how FA2 was originally born out of a refactor to use CUTLASS 3. Not the most technical talk but cool to peek behind the curtain of one of the most important ML systems contributions of our time.

4,635

levi

levi

@levidiamode

Jun 7

Link to FA3's NeurIPS openreview: openreview.net/forum?id=tVCo…

FlashAttention-3: Fast and Accurate Attention with Asynchrony and...

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. elaborated an approach to speed up attention on...

openreview.net

353

levi

levi

@levidiamode

Jun 6

levi

@levidiamode

Jun 5

153/365 of GPU Programming I've only ever looked at the original Flash Attention paper, so I'm spending this weekend to take a closer look at FA v2 and v3 this weekend. One thing I love about OpenReview is how you can look back at the back and forth authors had with reviewers (in this case, Tri Dao for his FA2 ICLR submission) answering questions and providing rebuttals. You can also tell which reviewers are more familiar with the specific topic over others, which makes the questions asked during the review phase also quite broad in nature.

5,742

levi

levi

@levidiamode

Jun 6

Link to talk (from @gordic_aleksa's YT channel): youtube.com/watch?v=IoMSGuiw…

Flash Attention 2.0 with Tri Dao (author)! | Discord server talks

❤️ Become The AI Epiphany Patreon ❤️https://www.patreon.com/theaie...

youtube.com

616

Thien Tran

levi retweeted

Thien Tran @gaunernst

Jun 6

Finally got some time to port my tcgen05 kernels to CuteDSL. For PTX enjoyers, this should feel natural (except TMA 🤣). A BF16 MMA mainloop is shown below. I also worked up an example for MXFP8 and NVFP4.

135

10,980

levi

levi

@levidiamode

Jun 5

levi

@levidiamode

Jun 4

152/365 of GPU Programming @mlech26l recently gave a really nice talk at MIT on Massively Parallel Training based on his experience as CTO of Liquid AI. The lecture provides a really practical overview of the implications of scaling laws, compute vs memory tradeoffs, activation checkpointing, offloading to CPU memory, RDMA, sharding tradeoffs (e.g. optimizer states, pipeline, column vs row parallelism, etc), load balancing challenges, DeekSeek ZeRO, FSDP strategies and stacking different kinds of parallelisms

5,460

levi

levi

@levidiamode

Jun 5

Link to paper and openreview: openreview.net/forum?id=mZn2…

FlashAttention-2: Faster Attention with Better Parallelism and Work...

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as...

openreview.net

321