DeepSeek-V2
model paperMassive 236B MoE model (21B active) that introduced Multi-head Latent Attention (MLA). Accompanied by a technical report.
Outputs 2
DeepSeek-V2 Technical Report
paperTechnical report detailing Multi-head Latent Attention and DeepSeekMoE architecture innovations.
arXiv: 2405.04434