Yuan 2.0-M32
modelMoE model with 32 experts (2 active), 40B total / 3.7B active parameters. Introduced "Attention Router" for expert selection, achieving 3.8% accuracy improvement over classical routers. Surpassed Llama3-70B on MATH and ARC-Challenge while requiring 1/19th the compute. Trained on 2T tokens.
Model Details
Architecture MOE
Parameters 40B
Active params 3.7B
Paper
arXiv: 2405.17976