One underrated thing about using Claude:
If you reply within a few minutes, Claude often reuses the modelβs KV cache instead of recomputing the entire conversation again.
During the βprefillβ stage, transformer attention states for every token are generated and stored temporarily in memory. Recomputing them is expensive.
So if you reply before the cache expires (~5 min), you get:
β’ lower latency
β’ fewer tokens reprocessed
β’ lower inference cost
Once the cache expires, those attention states may need to be rebuilt from scratch again.
Thatβs actually one reason I built Claude Pulse.
It shows a live cache countdown directly above the Claude chat box so you can literally see when your context cache is about to expire.
Tiny UI detail. Huge difference once you understand how LLM infra works.
Extension link in comments π