Nemotron-H
modelHybrid Mamba-Transformer architecture: 56B model has 54 Mamba-2 layers + 54 MLP layers + 10 self-attention layers (118 total), 8192 hidden dim, 64 query heads, 8 KV heads. Largest public FP8 pre-training: 56B model trained on 20T tokens in FP8.
Up to 3x faster inference than comparable Transformers (vs Qwen-2.5-72B, Llama-3.1-70B). 47B variant (compressed via MiniPuzzle) is 20% faster than 56B with minimal quality loss. MMLU: 84.21, ARC-C: 94.97.
Model Details
Architecture DENSE
Parameters 56B
Variants
| Name | Parameters | Notes |
|---|---|---|
| Nemotron-H-8B | 8B | — |
| Nemotron-H-47B | 47B | Compressed via MiniPuzzle |
| Nemotron-H-56B | 56B | — |
Paper
arXiv: 2504.03624