OLMo 2
model"2 OLMo 2 Furious." Dense Transformers at 7B (4T tokens), 13B (5T), and 32B (6T tokens, 1.5 epochs). 4K context. Introduces Dolmino Mix late-stage curriculum training (specialized data during annealing) and model souping (merging 3 annealing runs). Training FLOPs: 1.3x10^24 for 32B.
First fully open model to outperform GPT-3.5 Turbo and GPT-4o mini (post-trained with SFT + DPO + PPO + RLVR). MMLU: 78.7 (32B base). COLM 2025. Apache 2.0.
Model Details
Architecture DENSE
Parameters 32B
Context window 4,096
Variants
| Name | Parameters | Notes |
|---|---|---|
| OLMo 2 7B | 7B | — |
| OLMo 2 13B | 13B | — |
| OLMo 2 32B | 32B | — |
Paper
arXiv: 2501.00656
Venue: COLM 2025