Omni Model Inference: How We Move Tensors Between Stages
A friend once asked me: what's the fundamental difference between serving Omni multimodal models and serving plain LLMs? I thought about it and the simplest way to put it is this — a regular LLM handles a request with a single model in a single process; an Omni model handles a request by relaying it across multiple models.
Take Qwen3 Omni as an example. The lifecycle of a voice conversation request looks roughly like this: the user sends an audio clip, which first passes through an audio encoder that converts the waveform into embeddings, then feeds into the Thinker (a large language model) for inference to generate text tokens, and finally those tokens stream into the Talker (a speech synthesis model) that progressively produces audio waveforms to return to the user. TTS models with a Dual-AR architecture follow a similar pattern — a large AR model generates coarse-grained tokens, a small AR model fills in the fine-grained tokens, and a vocoder synthesizes the final audio.
These models have inherent data dependencies: if the Thinker doesn't produce tokens, the Talker has nothing to consume; if the large AR doesn't emit coarse tokens, the small AR can't fill in fine tokens. But at the same time, they must run in parallel — the Talker neither needs to nor can afford to wait for the Thinker to finish generating everything before it starts. Otherwise, the user would wait several seconds before hearing the first syllable, and the experience would completely fall apart. The dependency between them is streaming: as soon as the upstream produces a small chunk of data, the downstream must consume it immediately.
This is why we split the entire inference pipeline into multiple stages, each running a component model, with intermediate results passed between stages via inter-process communication. Each stage is an independent process with its own GPU/hardware management, its own scheduling loop, and its own batch management. The upstream stage produces tensors, the downstream stage consumes tensors — a textbook producer-consumer relationship.
But "passing tensors between processes" actually breaks down into two fundamentally different concerns. The first is signaling — telling the downstream that data is ready. A few dozen bytes, demanding low latency at the microsecond level. The second is data transfer — moving tens of megabytes of tensor data from one process to another, demanding throughput, ideally with zero copy.
ZMQ is naturally suited for signaling — lightweight and low-latency — but asking it to transfer a 64MB tensor means serialization overhead that blows up latency. Shared memory CUDA IPC is naturally suited for moving large blocks of data with near-zero copy, but it has no built-in event notification mechanism; you'd have to resort to polling or bolt on external signaling to notify the downstream.
So our design is straightforward: separate Control Plane from Data Plane. ZMQ handles only notifications (lightweight messages like DataReadyMessage), while the Relay handles only data (tensor transfer via shared memory / NCCL / CUDA IPC), each doing what it does best. Once this separation was established, many downstream architectural decisions fell into place naturally.
With the Control Plane and Data Plane separated, the next natural question is: how exactly does the Data Plane move data?
The most intuitive approach is serialization — the upstream serializes the tensor into a byte stream, sends it to the downstream via socket, and the downstream deserializes it back. Logically clean, but the cost is prohibitive: a 64MB tensor going through serialization, memory copy, the network stack, and deserialization every time — the latency and CPU overhead are simply unacceptable in a streaming inference scenario.
Since upstream and downstream stages run in different processes on the same machine, a more natural approach is shared memory: the upstream writes the tensor directly into a memory region accessible to both processes, and the downstream reads from the same address. No serialization needed, no copy needed, and with CUDA IPC, even GPU tensors can be accessed directly across processes — zero copy in the truest sense.
But shared memory is no free lunch. The biggest question is: who manages the read-write cadence of this memory? Upstream and downstream speeds don't necessarily match — the Thinker might suddenly slow down due to a long context, or the Talker might fall behind because of heavy vocoder computation. If the upstream writes faster than the downstream can consume, the shared memory will eventually be exhausted. This calls for a flow control mechanism.
We chose a credit mechanism, which is essentially a classic semaphore. A fixed number of shared memory slots are pre-allocated between upstream and downstream (say 10 slots, each 64MB), and the credit represents the number of currently available empty slots. Before writing data, the upstream acquires one credit; after writing, it sends a notification via ZMQ. Once the downstream finishes reading, it releases the credit, and the upstream can reuse that slot. When credits are exhausted, the upstream blocks — naturally forming backpressure. The pipeline's throughput automatically degrades to the speed of the slowest stage rather than blowing up memory. This is also why the downstream must consume as quickly as possible after receiving a notification: releasing credits lets the upstream keep pushing forward; otherwise, the entire pipeline stalls.
This approach looks simple at first glance, but it's worth comparing against several common alternatives:
Ring buffer — a fixed-size circular buffer maintaining read and write pointers, blocking when write catches up to read. However, our stages run on different GPUs, and cross-GPU tensor transfer goes through CUDA IPC. CUDA IPC is per-allocation: each slot is an independent cudaMalloc, corresponding to an independent IPC handle, and the downstream's mapped address is determined by the driver — slots are not contiguous in address space. The "single contiguous memory block" assumption that ring buffers rely on simply doesn't hold here. If you force it, you're just rotating a slot index with modulo, which is logically equivalent to credit counting but adds an unnecessary layer of abstraction.
Dynamic allocation — no pre-allocated fixed slots; malloc new memory each time and free it when done. Maximum flexibility, but in a shared memory context, cross-process shm allocation and deallocation is inherently heavy, and fragmentation accumulates relentlessly in long-running inference services. For a scenario where slot sizes are fixed and quantities are bounded, dynamic allocation is using a cannon to kill a mosquito.
Unbounded queue — unlimited capacity, upstream writes freely. Simplest to implement but provides zero flow control. If the downstream can't keep up, it's OOM. Unacceptable in production.
Drop without backpressure — when the upstream fills up, discard or overwrite. Works for real-time streaming media scenarios where dropping a few video frames goes unnoticed, but in an inference pipeline every token carries semantic meaning — dropping one means getting it wrong.
Comparing horizontally across these options, for the specific set of constraints we face — cross-process shared memory for large tensors, mismatched upstream/downstream speeds, and long-running operation — pre-allocated fixed slots semaphore counting is almost the most natural choice: zero fragmentation, bounded memory, built-in backpressure.
The more I work on systems design, the more I feel this: the hard part isn't coming up with a clever solution — it's recognizing, among a pile of solutions that all "seem to work," the one whose constraint alignment is the tightest. The credit mechanism is exactly this — at first glance it seems too textbook, but a textbook solution running stably in production means the problem was modeled correctly in the first place.