Mixtral 8x7B
modelPopularized Sparse Mixture-of-Experts for open-weight models. 46.7B total / 12.9B active per token, with 8 experts per layer and top-2 routing. 32K context.
Outperformed or matched Llama 2 70B and GPT-3.5 on all benchmarks. MMLU: 70.6%. Vastly superior on math, code, and multilingual tasks. Apache 2.0. Demonstrated that MoE could deliver frontier-class quality at a fraction of the inference cost.
Model Details
Architecture MOE
Parameters 46.7B
Active params 12.9B
Context window 32,000
Paper
arXiv: 2401.04088