kyle yu

kyle yu

11 Photos and videos

Tweets

Pinned Tweet

kyle yu @brrrkyle

Apr 16

this is how i wish i learned GPU fundamentals not a lengthy textbook. not a static image. every concept is an interactive visualization. covering the SM architecture, memory coalescing, synchronization, and more. what concepts do you want to see next? brrrviz.com

1:08

250

47,298

kyle yu

kyle yu @brrrkyle

Jun 9

wafer.ai stays cooking

steve

@gpusteve

Jun 8

brrrviz.com is quite nice

1,417

kyle yu

kyle yu @brrrkyle

Jun 2

I'm on vacation in Hong Kong and just shipped BrrrViz Chapter 11: Tiling from my hotel room. It's 6 interactive visuals that will help you grasp the concept. It's 1:43am. I'm tired. Hope it helps and go check it out :) brrrviz.com

821

kyle yu

kyle yu @brrrkyle

May 30

You launch a million threads and have them queue up to write to one address one at a time. That's atomicAdd. It's correct. It's also a for-loop on a parallel computer. Switch to a reduction tree and you fix the bottleneck, but introduce a new one: all but one thread is idle.

0:31

kyle yu

kyle yu @brrrkyle

May 30

Four interactive slides walk through the optimizations: 1. shared memory 2. warp packing 3. minimize bank conflicts 4. thread coarsening Free at brrrviz.com ⚡

376

Elliot Arledge

kyle yu retweeted

Elliot Arledge

@elliotarledge

May 13

great, intuitive resource. worth a few mins playing with as a refresher even if you've been through the fundamentals

kyle yu @brrrkyle

Apr 16

1:08

377

43,105

Pramod Goyal

kyle yu retweeted

Pramod Goyal

@goyal__pramod

May 12

HOLY JESUS THIS IS AMAZING

kyle yu @brrrkyle

May 11

Replying to @goyal__pramod

check out brrrviz.com for more gpu visuals 🤙

491

43,692

kyle yu

kyle yu @brrrkyle

May 7

Most GPU bugs don't crash your program. They just give you the wrong answer. Silently. When thousands of threads try to update the same memory address simultaneously, each one does three things: 📖 read the current value ⚡ execute their computation ✍ write back the result

242

more replies

kyle yu

kyle yu @brrrkyle

May 7

The cost: serialization. Threads queue at the address one at a time. The more threads contend for the same location, the more your parallelism collapses into a bottleneck. This is why real GPU kernels accumulate locally in registers first, then do a single atomicAdd at the end.

141

kyle yu

kyle yu @brrrkyle

May 7

Chapter 9 of BrrrViz walks you through both scenarios. brrrviz.com

129

Zak 🦈 (e/acc)

kyle yu retweeted

Zak 🦈 (e/acc)

@ZakShark

May 4

Formez vous à l'inference/kernel engineering. Savoir bien optimiser les GPU kernels dans les workloads d'inference vaut de l'or. Maitriser CUDA ou Triton, vLLM, SGLang, TensorRT-LLM est un vrai plus si vous voulez vous démarquer pour 2026-2027 en que AI/ML Engineer.

500

21,480

kyle yu

kyle yu @brrrkyle

Apr 29

Stop tuning the wrong bottleneck. GPU optimization isn’t one ceiling, it’s memory bandwidth vs peak compute. The roofline plots both, so you see which one limits your kernel.

116

kyle yu

kyle yu @brrrkyle

Apr 29

Memory-bound means your hardware is waiting on data. Fix data movement, locality, and reuse. Compute-bound means the data is there, but the math is slow on the hardware. Fix precision, use tensor cores, or change instruction path.

kyle yu

kyle yu @brrrkyle

Apr 29

Chasing utilization without this perspective often means optimizing the wrong thing. Understanding where your kernel sits on this diagram helps you execute better optimizations. Find it at chapter 3 of BrrrViz 👉 brrrviz.com/

Jino Rohit

kyle yu retweeted

Jino Rohit

@jino_rohit

Apr 15

i struggled a lot with visual GPU concepts, brrrviz seems like an incredible place to start with GPU concepts and start understanding them visually.

156

5,053

datavorous

kyle yu retweeted

datavorous

@datavorous_

Apr 25

life updates: - panicking as a stupid nervous intern handling aws ec2 instances - reading modal docs and brrrviz - studying for end semester exams - contemplating life choices; should i start over as a physics major?

1,268

kyle yu

kyle yu @brrrkyle

Apr 27

Dropped a new landing page and announced Act 02: ML Systems. I plan on covering transformer architecture, flash attention, KV cache, speculative decoding, and more. If you've ever wanted to actually understand how to run models fast on the hardware, this is for you.

0:13

kyle yu

kyle yu @brrrkyle

Apr 27

@ brrrviz.com

Bryan Johnson

kyle yu retweeted

Bryan Johnson

@bryan_johnson

Apr 27

go to bed right now i know the build is almost finished the eval can wait til morning the agent will still be failing tomorrow you won't figure out why it's hallucinating yes your coworker ships on 4 hrs of sleep they also hallucinate a lot off you go

447

343

7,613

385,661