Wrote about MiniMax Sparse Attention. The TL;DR: the headline numbers are the least interesting part of the paper.
28.4x attention FLOP reduction, 14.2x inference speedup, a million tokens of context. Real numbers. But they follow almost mechanically once you commit to blockwise selection and a 2,048-token budget per query. At a million tokens, that's attending to ~0.2% of the sequence. The speedup is large precisely because the budget is small. The whole thing is a bet that relevant information lives in a tiny selectable subset of the past. Correct for retrieval and agentic histories. Weaker, honestly, the more diffuse your dependencies get. "Broadly maintained" is load-bearing exactly there.
The actual contribution is the training recipe. Sparse attention isn't hard to define, it's hard to train, because the indexer and the backbone are coupled and the gradients want to do unpleasant things. Three pathologies, each with a fix that's more interesting than the failure:
The indexer needs a teacher. They distill it against the Main Branch attention distribution, so the cheap selector only has to rank blocks, not reproduce attention from scratch. Much easier learning problem.
The auxiliary loss eats the backbone. Let that KL gradient flow into the model and you've quietly changed its objective. Gradient spikes, general-ability degradation. Detach it. The before/after is the part a parity table would hide.
The indexer is a moving target. Early in training the Main Branch entropy collapses, so the student is chasing a teacher mid-seizure. Brief full-attention warmup fixes it. With the caveat the authors keep: "within the reported training range." That's the honest version, and it's the one I'd repeat.
Two things I respect. They kept a negative result, a learnable sink that partially worked and didn't ship. And the key comparison is FLOP-matched against a sliding window, not dense, which is the only comparison that isolates whether the selection machinery earns its keep. It does. 🧮
The speedup is what gets you to read it. The detach-and-warmup recipe is what makes it worth having read.
Full piece →
michaelchiesa.substack.com/
#AI #MachineLearning #LLMs