Filter
Exclude
Time range
-
Near
BLOCK DIFFUSIO FOR THE WIN!
This is what DFlash was built for. ⚡ Our block-diffusion drafter KV injection, now running at frontier scale — thanks to @modal and @sgl_project for the engine integration support!
7
870
karan sharma retweeted
Google did it again. 1,000 tokens per second on a smart local model. That speed normally needs a datacenter. DiffusionGemma hits it on a single H100, and still clears 700 per second on a consumer RTX 5090. What changed is how the model writes. Autoregressive LLMs write one token at a time, left to right. On a local GPU that wastes the chip, since each token barely uses it. Cloud servers hide the waste by batching many users together. Diffusion works differently. It borrows the idea behind image generators: start with noise, then refine it into something clean. For text, it drafts a whole block of 256 tokens at once, then makes a few passes to lock in the good ones and fix the rest. That gives the GPU real work to do, so it runs near full speed. Speed is only half of it. Because the model sees the whole block, every token can look at every other token, including the ones ahead. Autoregressive models can't look forward, and they can never take a token back. That matters more than it sounds. Take Sudoku: every cell depends on the others in both directions, so a left-to-right model paints itself into corners. A diffusion model fills the grid at once and refines until it fits. The same trick powers code editing and structured formatting, where seeing the end helps you write the start. DiffusionGemma is open under Apache 2.0 and fits in 18GB of GPU memory once compressed. One honest catch: quality sits below standard Gemma 4, and the speed win is a local, single-user thing. Still, a model that drafts whole blocks and fixes its own work changes what one machine can do. Google official blog post: deepmind.google/models/gemma… If you want to learn more about building your own diffusion LLMs, I'm sharing an open-source library I posted about a few days back. The tweet is quoted below.
Turn any Autoregressive LLM into a Diffusion LM. dLLM is a Python library that unifies the training & evaluation of diffusion language models. You can also use it to turn ANY autoregressive LM into a diffusion LM with minimal compute. 100% open-source.
11
23
182
20,607
Alok retweeted
Jun 11
Auto regressive LLMs are officially on notice. run Gemma 4 26B diffusion gguf with llama.cpp Google just dropped DiffusionGemma-26B, and it completely flips how we generate text. instead of predicting words one by one, it generates 256 tokens in parallel using bi-directional attention. its like stable diffusion, but for language. the model starts with random text "noise" and iteratively refines and self-corrects the entire block in real-time to fix formatting and reasoning errors on the fly. since it’s a Mixture of Experts (MoE) that only activates 3.8B parameters during inference, it fits perfectly on consumer hardware. You can run the Q4_K_M quant with an 18GB VRAM budget on a single RTX 3090 or RTX 4090 with exceptional throughput. Tested on Ubuntu 22 with CUDA 13.1 using the cutting edge experimental llama.cpp branch. Here is how to compile and run it with the live terminal denoising visualizer: # 1. Clone & check out the experimental PR (#24423) - 1) git clone github.com/ggml-org/llama.cp… && cd llama.cpp -git fetch origin 2) pull/24423/head:diffusiongemma && --git checkout diffusiongemma # 2. Build with CUDA support 1) cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native 2) cmake --build build -j $(nproc) --config Release --target llama-diffusion-cli # 3. Run with live visual denoising (llama.cpp flags) ./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 --diffusion-visual Watch the video below to see the live --diffusion-visual canvas iteratively de noising the prompt output in real time. guide and unsloth's hugging face GGUF model links are in the comments below! Is auto regressive generation officially legacy tech? Let me know what you think.
Meet DiffusionGemma! An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license. Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇
18
40
350
52,638
Syntax Diffusion retweeted
I create a quick demo project of the new Gemma Diffusion model! Its really interesting to watch and play with diffusion LLMs. I personally think its going to be an awesome class of its own. All quants available via the app inside. Link below
1
3
28
1,209