Running NVIDIA Nemotron-3-Super-120B-A12B (NVFP4 mixed) on DGX Spark /
GB10 (Grace-Blackwell) via vLLM 0.19.2rc1. Architecture resolves as
NemotronHMTPModel — hybrid Mamba Transformer with MTP draft head.
PROBLEM: prefix caching is non-functional on this model.
With --enable-prefix-caching explicit, vLLM emits this warning on boot:
"Prefix caching in Mamba cache 'all' mode is currently enabled.
Its support for Mamba layers is experimental."
Measured behavior over a real production run:
queries: 6,211
hits: 0
hit rate: 0.00%
Confirmed across thousands of requests with shared system prompts that
should be deduplicating cleanly. They aren't.
This isn't a misconfiguration — NVIDIA's own Nemotron-3-Super deployment
guide (<
github.com/NVIDIA-NeMo/Nemot…>
SparkDeploymentGuide) deliberately OMITS --enable-prefix-caching. So
upstream knows it doesn't work on NemotronH.
WHY IT MATTERS: We run a 4-way parallel fan-out pattern — brain
decomposes a brief, dispatches 4 concurrent section-writes against vLLM
with an identical 163-token shared system prompt, stitches results.
Measured throughput:
c=1: 15.0 tok/s
c=2: 25.4 tok/s
c=3: 30.1 tok/s
c=4: 41.4 tok/s (← saturation; 4th slot effectively free)
That 2.76× user-facing speedup is real. But every one of those 4 calls
re-prefills the same 163-token system message independently. With a
working prefix cache, that prefill cost drops ~4×. For larger shared
prompts (RAG context, long instruction blocks, few-shot exemplars),
the win compounds enormously.
THE STRUCTURAL QUESTION: Is this a fundamental limit? Mamba's selective-
scan state can't be paged like transformer KV — it's a fixed-size
recurrent state, not a sequence of key/value vectors. So for the Mamba
LAYERS of a hybrid model, prefix cache semantics are genuinely unclear.
But for the TRANSFORMER LAYERS interleaved with them, the KV reuse
should work fine.
Has anyone at
@NVIDIAAIDev or
@NVIDIAAI considered a "hybrid prefix
cache" mode that caches the transformer-layer KV pages for a matched
prefix while re-running the Mamba state forward? Even a partial fix
would eliminate most of the prefill cost on this architecture.
Or — is there a known issue / planned vLLM PR I should be watching?
Happy to share the bench harness if it's useful for repro.
cc
@vllm_project — same question your side.
Hardware: DGX Spark, GB10, 121 GiB unified memory, --max-num-seqs 4,
--quantization fp4, MARLIN MoE backend, async scheduling, no MTP
(breaks structured tool-call emission — separate issue).