I’m writing a full piece about my insights from Nebius Inflection, but one topic I found particularly interesting was disaggregated inference.
$NBIS is already planning for it as a way to extend the useful life of GPUs.
Let me put it simply...
LLM inference has two main phases:
Prefill, when the model processes the user’s prompt/context.
Decode, when the model generates the answer token by token.
The important point is that these two phases stress infrastructure differently.
Prefill is more compute-intensive and benefits from processing many tokens in parallel, while decode is more latency-sensitive and often more memory-bandwidth constrained.
Disaggregated inference separates these workloads, allowing different infrastructure pools to be optimized for each stage.
Why does this matter?
Because it can improve GPU utilization, reduce cost per token, lower latency, and make inference more efficient at scale.
All of this can also help extend each GPU’s economic useful life.
Same old story...
As new GPU generations come out, older GPUs may become less attractive for cutting-edge training. But that doesn’t mean they become useless.
If
$NBIS can intelligently allocate different parts of inference workloads across different types of hardware, older GPUs can remain productive for longer.
That has direct implications for ROIC.
In my view, this is exactly the kind of infrastructure-level optimization that separates a serious AI cloud platform from a simple GPU capacity reseller.
As I’ve been saying since the beginning, building a sustainable AI cloud isn’t just about plugging GPUs into electricity, and this is a good example of that.
$NBIS' engineering advantage is what can make this type of optimization possible.