π Congrats to
@MiniMax_AI on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model.
At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve.
M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware:
β¨ MSA sparse attention with dedicated prefill and decode kernels
β¨ 1M-token context serving with prefix caching and chunked prefill
β¨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell
β¨ Native multimodal input (image video)
β¨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads
Day-0 support like this is a true team effort. Grateful to the teams at
@MiniMax_AI,
@NVIDIAAI,
@AIatAMD, and
@inferact, and to the vLLM community for making it happen. π
Deep dive into the implementation, kernel work, and deployment recipes:
π
vllm.ai/blog/2026-06-12-miniβ¦