Google did it again.
1,000 tokens per second on a smart local model.
That speed normally needs a datacenter. DiffusionGemma hits it on a single H100, and still clears 700 per second on a consumer RTX 5090.
What changed is how the model writes.
Autoregressive LLMs write one token at a time, left to right. On a local GPU that wastes the chip, since each token barely uses it. Cloud servers hide the waste by batching many users together.
Diffusion works differently. It borrows the idea behind image generators: start with noise, then refine it into something clean.
For text, it drafts a whole block of 256 tokens at once, then makes a few passes to lock in the good ones and fix the rest.
That gives the GPU real work to do, so it runs near full speed.
Speed is only half of it. Because the model sees the whole block, every token can look at every other token, including the ones ahead.
Autoregressive models can't look forward, and they can never take a token back.
That matters more than it sounds. Take Sudoku: every cell depends on the others in both directions, so a left-to-right model paints itself into corners. A diffusion model fills the grid at once and refines until it fits.
The same trick powers code editing and structured formatting, where seeing the end helps you write the start.
DiffusionGemma is open under Apache 2.0 and fits in 18GB of GPU memory once compressed.
One honest catch: quality sits below standard Gemma 4, and the speed win is a local, single-user thing.
Still, a model that drafts whole blocks and fixes its own work changes what one machine can do.
Google official blog post:
deepmind.google/models/gemma…
If you want to learn more about building your own diffusion LLMs, I'm sharing an open-source library I posted about a few days back. The tweet is quoted below.
Turn any Autoregressive LLM into a Diffusion LM.
dLLM is a Python library that unifies the training & evaluation of diffusion language models.
You can also use it to turn ANY autoregressive LM into a diffusion LM with minimal compute.
100% open-source.