MiniMax Sparse Attention (MSA)

The attention architecture behind MiniMax-M3. MSA is a learned blockwise sparse attention built on GQA: a lightweight Index Branch scores KV blocks and selects a per-GQA-group top-k subset, then a Main Branch runs exact block-sparse attention over only the selected blocks, with a co-designed GPU kernel (exp-free top-k, KV-outer sparse attention).

On a 109B multimodal model the paper reports MSA matching dense GQA quality while cutting per-token attention compute ~28.4× at 1M context, with 14.2× prefill / 7.6× decode wall-clock speedups on H800 — the efficiency basis for shipping a 1M-context production model.

Paper (arXiv)

Paper

arXiv HTML

architectureefficiencyresearch

MiniMax Sparse Attention (MSA)

Your notes

Paper

Related