160/365 of GPU Programming
Taking a little break from studying FlashAttention to continue with CS336 today.
Lecture 10 is on inference and starts with a nice encapsulation of why inference is so important, continues on with what metrics matter in inference (TTFT, latency, throughput), then goes through how you can calculate arithmetic intensities for various operations before breaking down compute vs memory bound, KV cache, prefill vs decode, GQA, MLA, CSA, DSA, HCA, linear attention, QAT, PTQ, speculative decoding, continuous batching and paged attention.
Even if you're familiar with most of these concepts, it's a nice review of the inference stack and touches on some of the tradeoffs inherent in inference optimizations.
159/365 of GPU Programming
Doing a regular review today just looking at what I've been studying the past few weeks and looking at questions/confusions I wrote down along the way.
Will focus on parallelism/sharding strategies, FlashAttention v1-v4 and take another look at scaling laws tomorrow.