Researchers found a way to make LLMs 8.5x faster!
(without compromising accuracy)
Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference.
A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass.
If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding.
But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x.
DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot.
Drafting cost stays flat no matter how many tokens you speculate.
On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch.
In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss.
It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more.
I have shared the GitHub repo in the replies!
KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below.
I'll soon publish another article on speculative decoding.
Stay tuned!!