Your LLM API is leaking its stack β not the model, the stack underneath.
@mlsec at
@bifoldberlin /
@TUBerlin (arXiv:2605.29979, May 28 2026): floating-point non-associativity is a side channel. Same model, same prompt, different engine / attention / GPU β different tokens. From chat output only β no logits.
Example. Qwen2.5-7B, greedy, "Did England lose a 1966 World Cup game?":
β’ SGLang FlashAttention A100 β P(Yes)=0.378
β’ SGLang FlashInfer H100 β P(Yes)=0.500
β’ TensorRT-LLM FlashInfer H100 β P(Yes)=0.562
The attack: four prompt sets, each probing one stage:
1) Rare-token recall β autoregressive decode
2) Yes/No β prefill
3) Numeric ID in long context β chunked prefill
4) "Repeat N times" β KV-cache reuse
Score responses, train a random forest.
Across 4 engines (
@vllm_project,
@sgl_project, TensorRT-LLM, LMDeploy), 6 attention backends, 3 GPUs (
@nvidia H100/A100/L4) on Llama-3.2 (
@AIatMeta) Qwen (
@Alibaba_Qwen):
β’ T=0: 100% ID on engine, attention, GPU
β’ T=0.6: 76% engine / 58% attention / 59% GPU
β’ 113 prompts is enough
β’ Survives unseen batch sizes app prompts (99.7%)
Why prod leaders should care: pin your engine and an attacker aims CVE-2026-22778 (vLLM RCE) or CVE-2026-5760 (SGLang RCE) at it. "Just unify the kernels" isn't a defense β unified stacks add 100% latency.
Real mitigations cost utility: noise breaks deterministic decode, rate-limits are Sybil-evadable.
Determinism is now a security property.
arxiv.org/abs/2605.29979