Simplified Mixture-of-Experts routing for Transformers. Routes each token to a single expert (vs multiple in earlier MoE work), achieving 7x pre-training speedup over T5 with the same compute. First to train large MoE models in bfloat16. Scaled to 1.6T parameters.

Switch Transformer made MoE practical at scale and is foundational to modern sparse architectures used by Mixtral, DeepSeek, Gemini, and others. JMLR 2022. By Fedus, Zoph, and Shazeer.

Paper

arXiv: 2101.03961

Venue: JMLR 2022

foundationalmoe

Related