DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics
DiffusionGemma generates text up to 4x faster than autoregressive LLMs, hits 1,000 tokens/sec on a single H100, and runs on a consumer RTX 4090. Here is what changed, what the trade-offs are, and...
dev.to