MiniMax Sparse Attention (MSA)
paperThe attention architecture behind MiniMax-M3. MSA is a learned blockwise sparse attention built on GQA: a lightweight Index Branch scores KV blocks and selects a per-GQA-group top-k subset, then a Main Branch runs exact block-sparse attention over only the selected blocks, with a co-designed GPU kernel (exp-free top-k, KV-outer sparse attention).
On a 109B multimodal model the paper reports MSA matching dense GQA quality while cutting per-token attention compute ~28.4× at 1M context, with 14.2× prefill / 7.6× decode wall-clock speedups on H800 — the efficiency basis for shipping a 1M-context production model.