Just like we saw with NVIDIA, TPUs are offloading many phases of inference away from HBMs (TPU8i)
The clearest case is on the 8i (serving/inference) side. It carries 3x more on-chip SRAM than the previous generation, and the stated purpose is that TPU 8i can host a larger KV Cache entirely on silicon, significantly reducing the idle time of the cores during long-context decoding.
That's a direct HBM offload: in autoregressive decoding the KV cache is normally the dominant bandwidth consumer streaming out of HBM every token, so pulling it into Vmem (384 MB vs. 128 MB on the 8t per the spec table) takes that traffic off the HBM line entirely.
TPU 8t216 GB128 MB~1,700:1
TPU 8i288 GB384 MB~750:1
As models get larger and inference demands get tighter, SRAM and KV-cache residency.
Demand for HBM is still increasing, but the necessity of HBM in many phases is gradually decreasing. This allows us to focus on the long-term operating margin rather than simply volume sold.