Path-Constrained Mixture-of-Experts
paperIntroduces a novel perspective on sparse MoE by examining expert paths — the sequence of expert selections tokens make across layers. Observes that tokens naturally cluster into a small set of paths aligned with linguistic function, despite a theoretically exponential path space.
Constrains the effective path space through parameter sharing across consecutive layer blocks, producing more concentrated path clusters, better cross-layer consistency, and greater routing robustness — without auxiliary losses. Tested at 0.9B and 16B scale with consistent improvements on perplexity and downstream tasks. By Gu, Likhomanenko, Thilak, Ramapuram, and Jaitly.