MiniMax-M1 hits 56% on SWE-bench at 30% of DeepSeek R1's compute cost.
The architecture: hybrid Lightning Attention Mixture of Experts.
This isn't an isolated result. Mixtral 8x7B, Qwen3.5-397B-A17B (397B total, 17B active), Mistral Large 3 (675B total, 41B active), GLM-5 (744B total, 40B active) — every serious frontier model is now sparse by default.
The pattern is the same across all of them: massive parameter counts for capacity, tiny active parameter counts for inference cost. You get the knowledge surface of a 700B model at the compute cost of a 40B one.
Dense decoder-only transformers aren't dead. They're just no longer the frontier architecture. That transition happened quietly, without a press release.
The unsolved problem is infrastructure. vLLM and SGLang were built around dense activation patterns. Sparse, conditional routing creates load imbalance across experts, unpredictable memory access, and expert parallelism overhead that can erase the theoretical compute savings in production.
So the research frontier has settled. The next 12 months aren't about whether MoE wins — it already did. They're about whether inference infrastructure can actually exploit sparse activation at scale without giving back the efficiency gains on real hardware.
Training compute optimization is largely a solved problem at this point. The frontier has moved to inference-time scaling. Same architectural pressure, different phase.