Unified optimization framework that brings variance reduction back into contention for large-model training, where AdamW has dominated for a decade. MARS reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique, and instantiates the framework as three concrete optimizers built on AdamW, Lion, and Shampoo updates respectively.

Variance reduction (SAG, SVRG, STORM) has historically failed to outperform AdamW in deep learning. The authors argue this is partly because techniques like batch normalization and dropout — which break the finite-sum structure variance reduction needs — are now rarely used in modern LLM training, reopening the door. On GPT-2 pretraining, MARS consistently outperforms AdamW by a large margin.

Collaboration between UCLA (Quanquan Gu's group) and ByteDance Seed (San Jose + Beijing); work led by Huizhuo Yuan and Yifeng Liu during Liu's ByteDance internship. ICML 2025 (PMLR 267); latest v4 published 2025-09-04.

Paper

optimizationtrainingresearch