A 456B parameter MoE model series featuring a 4-million-token context window and the Lightning Attention architecture. This work is highly influential for its Batch Size Scaling (BSS) strategy, which optimizes training by aligning the batch size with the Critical Batch Size (CBS) derived from a power-law relationship with training loss. By starting small (16M tokens) and scaling to 128M tokens in discrete steps, MiniMax-01 maximizes Model FLOPs Utilization (MFU) and training stability.

This "fast catch-up" approach has become a 2026 industry standard, adopted by models like Arcee Trinity Large and mathematically validated in Fast Catch-Up, Late Switching.

Outputs 3

MiniMax-Text-01

model

456B parameter MoE model with a 4-million-token context window.

Architecture MOE
Parameters 456B
Context window 4,000,000

MiniMax-VL-01

model

Vision-language model using ViT-MLP-LLM framework with 303M parameter ViT encoder and MiniMax-Text-01 as base LLM. Trained on 512B vision-language tokens. Matches GPT-4o and Claude-3.5-Sonnet performance.

Architecture MOE
Parameters 456B
Context window 4,000,000

MiniMax-01: Scaling Foundation Models with Lightning Attention

paper

Foundational paper for the linear-attention architecture used across MiniMax models.

arXiv: 2501.08313

moescalingopen-weightarchitectureattentionefficiencymultimodal

Related