There is a 1,500x bandwidth gap hiding in the inference memory hierarchy. And it is about to become everyone's problem.
NVIDIA's Vera Rubin platform now treats the NVMe SSD as part of the inference memory hierarchy. Their CMX architecture (launched at GTC last month with BlueField-4 STX) offloads KV cache from GPU memory to flash so that evicted context survives instead of being recomputed. The performance gain is up to 5x higher tokens per second. Dell, HPE, VAST Data, WEKA, and a dozen other vendors are building products around it. CoreWeave, Lambda, and Oracle are early adopters.
That means the SSD is no longer background storage. Its read latency directly affects time to first token. On a deployment serving tens of thousands of requests per hour, even a small increase in drive latency compounds across every single one.
The technical detail worth understanding: KV cache stores the attention state for every token in context. As context windows grow into the hundreds of thousands of tokens, the cache overflows GPU HBM (22 TB/s on Rubin) into host DRAM (~300 GB/s) and then onto NVMe (7-14 GB/s on PCIe Gen5). That is a 1,500x bandwidth drop from HBM to SSD. The system works when the drive is healthy. When it is not, every cache read slows down, and the latency shows up in token generation without any GPU-side signal that something is wrong.
KV cache recycling is also a write-heavy workload. Every eviction and reload burns SSD write cycles, and the drive's internal garbage collection multiplies those writes further depending on drive quality and whether you are running enterprise TLC or cheaper QLC flash. As the NAND wears, read latency creeps up gradually. The degradation looks like a model problem or a network issue long before anyone checks the drive.
Most inference monitoring tracks GPU utilization, temperature, and token throughput. SSD wear, write amplification, and drive-level latency are almost never in the stack.
Is drive health part of your inference observability, or is the SSD still invisible?