Auxiliary-Loss-Free Load Balancing Strategy
paperA foundational paper for modern Mixture-of-Experts (MoE) architectures that introduces the "Loss-Free Balancing" strategy. It eliminates the traditional "auxiliary loss tax" by replacing the static penalty term with dynamic, expert-wise biases adjusted during training. This decoupling of routing decisions from gradient updates prevents interference with the primary objective, enabling a higher performance ceiling and zero token dropping. This strategy is a core innovation in DeepSeek-V2 and V3, allowing them to achieve frontier-level efficiency and training stability.
Paper
arXiv: 2408.15664