Pioneering 16B Mixture-of-Experts model with only 2.8B active parameters, setting the stage for DeepSeek's future efficiency focus. Accompanied by a foundational paper on expert specialization.

Outputs 2

DeepSeek-MoE

model
Architecture MOE
Parameters 16B
Active params 2.8B

DeepSeekMoE: Towards Ultimate Expert Specialization

paper

Foundational paper on expert specialization in MoE language models.

arXiv: 2401.06066

moeefficiencyopen-weightarchitecture