Producing just a single Claude Fable 5 or GPT 5.5 token moves hundreds of GB of data.
Indeed, as large models continue to scale active parameters and context sizes, an increasing share of time goes to moving data to the processors rather than actually processing it. The result is longer serving latency and expensive, idle compute. This is called the memory wall, and different approaches and tradeoffs to it are explored below.
For the full interactive essay, check out the link in comments.
-----
Modern GPUs are increasingly fast at performing the arithmetic behind frontier AI. The issues arise because the data for the computation has to be physically present on the compute die before the arithmetic can start. For large models, this can be a lot of data. Specifically:
- The model weights need to be available. A good heuristic is that 1B parameters equates to approximately 1GB of weights (at 8-bit precision). So Mythos-class models with 10T parameters require 10 TB of memory to store them.
-- Mixture-of-expert architectures enable inference on a subset of these weights, so for a given computation perhaps “only” a few hundred GB is actually needed. - The user context also needs to be available. For frontier models, this context can be up to 1M tokens, each of which attends to every other token. The resulting key-value (KV) cache can easily be tens of GB per user.
The problem arises as the ratio of data needed to on-die memory available continues to climb. As an example, consider the new NVIDIA Blackwell Ultra which has 160 register files of 256 KB each that can be processed in parallel. That means that for 100GB of weights there is roughly 40MB of available working memory, and that to produce just one token those weights must stream through in 100 GB / 40 MB = 2,500 fills of the combined register files — and they must do so again for every token, before even factoring in user context. Modern architectures tier the memory so that the most commonly needed values are stored closest to the compute dies, so often these fills are blindingly fast. But in a sequence that produces millions of tokens, in parallel for hundreds of millions of users, time and idle compute quickly add up.
The link beloe illustrates two different breakdowns of the problem. The first breakdown looks at how real chip architectures make tradeoffs to tackle this problem, using real model scenarios (The Cost to Produce a Token). These chips are then compared on axes of token throughputs and configurability, with different model selections again highlighting the tradeoffs made (The Architecture Tradeoff Matrix). The second breakdown explores how fast each successive tier of memory is in this shuttling sequence and the impact of storing values at that layer on token throughput for the same real model scenarios (Memory Tiering Impact on Token Throughput).