KV cache is a nice example of how much of LLM engineering is just avoiding repeated work.
During generation, the new token needs to attend to old tokens. The old tokens have already produced their keys and values, and they are not changing. So instead of recomputing them every step, the model stores them and reuses them. That stored state is the KV cache.
This makes decoding much faster, but it moves the pressure somewhere else: memory. Longer context means a larger cache. More layers, heads, batch size, and concurrent requests mean more memory pressure.
So, I made a video explaining the KV cache in detail 👇.