DeepSeek-MoE

Pioneering 16B Mixture-of-Experts model with only 2.8B active parameters, setting the stage for DeepSeek's future efficiency focus. Accompanied by a foundational paper on expert specialization.

HuggingFace Paper (arXiv)

Outputs 2

model

HuggingFace

Architecture MOE

Parameters 16B

Active params 2.8B

DeepSeekMoE: Towards Ultimate Expert Specialization

paper

Foundational paper on expert specialization in MoE language models.

Paper (arXiv)

Citations 16

arXiv HTML

moeefficiencyopen-weightarchitecture