MiniMax-01 | Lab Index

A 456B parameter MoE model series featuring a 4-million-token context window and the Lightning Attention architecture. This work is highly influential for its Batch Size Scaling (BSS) strategy, which optimizes training by aligning the batch size with the Critical Batch Size (CBS) derived from a power-law relationship with training loss. By starting small (16M tokens) and scaling to 128M tokens in discrete steps, MiniMax-01 maximizes Model FLOPs Utilization (MFU) and training stability.

This "fast catch-up" approach has become a 2026 industry standard, adopted by models like Arcee Trinity Large and mathematically validated in Fast Catch-Up, Late Switching.

Paper (arXiv)HuggingFace GitHub Announcement

Outputs 3

MiniMax-Text-01

model 2025-01-15

456B parameter MoE model with a 4-million-token context window.

Announcement HuggingFace Paper (arXiv)

Architecture MOE

Parameters 456B

Context window 4,000,000

MiniMax-VL-01

model 2025-01-15

Vision-language model using ViT-MLP-LLM framework with 303M parameter ViT encoder and MiniMax-Text-01 as base LLM. Trained on 512B vision-language tokens. Matches GPT-4o and Claude-3.5-Sonnet performance.

HuggingFace GitHub

Architecture MOE

Parameters 456B

Context window 4,000,000

MiniMax-01: Scaling Foundation Models with Lightning Attention

paper

Foundational paper for the linear-attention architecture used across MiniMax models.

Paper (arXiv)HuggingFace

Citations 1

arXiv HTML

moescalingopen-weightarchitectureattentionefficiencymultimodal

Outputs 3

MiniMax-Text-01

MiniMax-VL-01

MiniMax-01: Scaling Foundation Models with Lightning Attention

Related