Mixtral 8x7B | Lab Index

Popularized Sparse Mixture-of-Experts for open-weight models. 46.7B total / 12.9B active per token, with 8 experts per layer and top-2 routing. 32K context.

Outperformed or matched Llama 2 70B and GPT-3.5 on all benchmarks. MMLU: 70.6%. Vastly superior on math, code, and multilingual tasks. Apache 2.0. Demonstrated that MoE could deliver frontier-class quality at a fraction of the inference cost.

Paper (arXiv)HuggingFace

Model Details

Architecture MOE

Parameters 46.7B

Active params 12.9B

Context window 32,000

Paper

arXiv HTML

moeopen-weightfrontier

Model Details

Paper

Related