DeepSeek-V4
model paperDeepSeek's most powerful model family and the first frontier-scale model trained entirely on Huawei Ascend 950PR chips — zero NVIDIA/CUDA dependency anywhere in the stack. Two variants: V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B total / 13B active MoE). Both support 1M token context. Pre-trained on 32T+ tokens with FP8 mixed precision (Pro) and FP4+FP8 (post-training).
Key architectural innovations: Hybrid Attention combining Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA), requiring only 27% of inference FLOPs and 10% of KV cache vs. V3.2. Manifold-Constrained Hyper-Connections (mHC) for stable deep signal propagation. Muon optimizer for training stability. Three reasoning modes: Non-think, Think High, and Think Max (384K+ context optimal).
V4-Pro-Max benchmarks: LiveCodeBench 93.5% (#1), Codeforces Rating 3206 (vs. GPT-5.4: 3168), IMOAnswerBench 89.8% (vs. Opus 4.6 Max: 75.3%), SWE-bench Verified 80.6%. Base model: MMLU 90.1%, C-Eval 93.1%, GSM8K 92.6%. Post-training uses two-stage pipeline: domain-specific expert cultivation (SFT + GRPO) then unified on-policy distillation. MIT License.
Outputs 3
DeepSeek-V4-Pro
modelDeepSeek-V4-Flash
modelDeepSeek-V4 Technical Report
paper"Towards Highly Efficient Million-Token Context Intelligence." Details CSA+HCA hybrid attention, mHC residual connections, Muon optimizer, and two-stage post-training pipeline.