What is disaggregated inference?
What does it do?
When does matter?
Who is it built for?
What Is Disaggregated Inference?
In the AI world: “Training” is how AI is made. “Inference” is how AI is used.
“Inference disaggregation” is a technique to divide and conquer inference compute.
Disaggregation separates inference into two stages: prompt processing, called “prefill,” and output generation, called “decode.”
Prefill - where the model processes your prompt. This is the part you type into ChatGPT, for example.
Decode - where the model generates new tokens one at a time to create the response that you read. This is the answer you get back from GPT.
Why Disaggregation Matters
These two stages have very different computational characteristics.
Prefill is natively parallel and requires little memory bandwidth.
Decode is inherently serial, and memory bandwidth intensive.
Prefill can be done quickly while decode accounts for the majority of time between hitting send and getting your full answer. This is because decode is a sequential process, each output token (word) must be generated before the next can begin.
Because the stages are so different, there's an opportunity to specialize, that is, to divide and conquer.
Rather than 1 processor doing both jobs, you can use 2 different processors, each with an architecture suited to its task. The result of this specialization is higher throughput and lower power consumption.
The Tradeoff
In computer architecture, there is no free lunch. The cost of specialization is lost flexibility.
Deploying separate hardware for prefill and decode locks in the ratio between them.
For example, out of every 100 racks, you might allocate 30 to prefill and 70 to decode. That ratio is fixed at deployment time.
When you can predict key workload characteristics, input/output ratio, KV cache size, cache hit rate, specialization delivers exceptional value. But it's fragile.
If workload characteristics shift, you end up with the wrong balance of prefill and decode hardware.
The result: stranded capacity, lower utilization, higher power draw, and higher costs.
The challenge of course, is that hardware deployments are meant to last for five or six years.
And that data centers are physically configured for the hardware deployed. Change is expensive.
When you can’t predict workload characteristics with high accuracy, then specialization through disaggregation will cost more and consume more power.
Who Benefits, Who Doesn't
Hyperscalers, who have fleets of different processors and who can move workloads across their fleet, will easily overcome the lack of flexibility in disaggregated solutions.
And they will benefit enormously from it. If the workload changes, they can direct that traffic to different processors in their massive fleets.
However, for enterprises and neoclouds, who have long depreciation schedules, and are locked into a specific vendor’s processor architecture, the rapidly changing AI landscape will be real challenge for disaggregated solutions.
The Bottom Line
If you know your workload well and are confident it won't change much, or if you have a large pool of diverse hardware to absorb shifts, disaggregation is a good choice.
If you can't predict your traffic or lack a flexible hardware fleet, a more general-purpose approach that handles prefill and decode on the same hardware is probably the safer bet.
Final thoughts
Disaggregated inference is still a new technology. I'm often asked what percentage of AI data centers will be built this way.
The honest answer is that no one knows yet. The battle between specialized solutions and more general ones are always interesting and difficult to predict.
But overall, with AI inference growing so quickly, I expect disaggregation will add to, rather than replace, the way we do inference today.