DeepSeek's most powerful model family and the first frontier-scale model trained entirely on Huawei Ascend 950PR chips — zero NVIDIA/CUDA dependency anywhere in the stack. Two variants: V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B total / 13B active MoE). Both support 1M token context. Pre-trained on 32T+ tokens with FP8 mixed precision (Pro) and FP4+FP8 (post-training).

Key architectural innovations: Hybrid Attention combining Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA), requiring only 27% of inference FLOPs and 10% of KV cache vs. V3.2. Manifold-Constrained Hyper-Connections (mHC) for stable deep signal propagation. Muon optimizer for training stability. Three reasoning modes: Non-think, Think High, and Think Max (384K+ context optimal).

V4-Pro-Max benchmarks: LiveCodeBench 93.5% (#1), Codeforces Rating 3206 (vs. GPT-5.4: 3168), IMOAnswerBench 89.8% (vs. Opus 4.6 Max: 75.3%), SWE-bench Verified 80.6%. Base model: MMLU 90.1%, C-Eval 93.1%, GSM8K 92.6%. Post-training uses two-stage pipeline: domain-specific expert cultivation (SFT + GRPO) then unified on-policy distillation. MIT License.

Outputs 3

DeepSeek-V4-Pro

model
Architecture MOE
Parameters 1.6T
Active params 49B
Context window 1,000,000
Training tokens 33T

DeepSeek-V4-Flash

model
Architecture MOE
Parameters 284B
Active params 13B
Context window 1,000,000
Training tokens 32T

DeepSeek-V4 Technical Report

paper

"Towards Highly Efficient Million-Token Context Intelligence." Details CSA+HCA hybrid attention, mHC residual connections, Muon optimizer, and two-stage post-training pipeline.

frontiermoeopen-weightreasoningcodingefficiency

Related