Day 3/30 of LLM Inference
Today's focus: KV Cache - why it exists and what it actually does
Yesterday we covered prefill vs decoding. Today is about the thing that makes decoding not completely terrible.
The problem without KV Cache:
Every time the model generates a new token, it would need to recompute attention over every single previous token from scratch. For a 2000 token context, that's 2000 recomputations per new token. Pure waste.
What KV Cache does:
During prefill, the model computes Key and Value matrices for every token in your prompt. Instead of throwing them away, we store them in GPU memory.
During decoding, each new token only computes its own K and V, then reuses everything already cached.
No recomputation. Just a memory lookup.
The tradeoff:
KV Cache trades compute for memory. The cache grows with every new token generated. For long contexts and large batches this blows up fast.
This is exactly why:
→ Context length is expensive
→ Batch size has limits
→ vLLM's PagedAttention was a big deal (Day 5)
KV Cache is the single most important optimization in LLM inference today. Everything else builds on top of it.
Day 4 tomorrow: Batching and throughput 📦
#LLMInference #KVCache #MachineLearning #GenAI #GPU #LLM #AIEngineering #MLOps #Transformers