MiniMax-M3 combines 1M context, native multimodality, and MiniMax Sparse Attention.
The next layer is serving it efficiently: KV-block-major sparse attention, paged MSA decode, optimized index scoring, and multimodal preprocessing before the GPU worker.
Together’s Inference and Kernel teams improved throughput by 81–125% across common agentic-shape traffic.
We go deeper in this deep dive from
@ywangfirstlean,
@zhyncs42,
@realDanFu and the team.