Molmo 2
modelOpen VLM family expanding Molmo's image strengths to video and multi-image understanding. 4B (Qwen 3), 8B (Qwen 3), and 7B-O (OLMo backbone, fully open end-to-end). 7 new video datasets + 2 multi-image datasets collected without closed VLMs.
SOTA open model for video tracking, leapfrogging Gemini 3 Pro. 8B surpassed prior 72B Molmo on image QA. Capabilities: video QA, video counting, video tracking, and point-driven grounding across single image, multi-image, and video.
Model Details
Architecture DENSE
Variants
| Name | Parameters | Notes |
|---|---|---|
| Molmo 2 4B | 4B | — |
| Molmo 2 8B | 8B | — |
| Molmo 2-O 7B | 7B | OLMo backbone, fully open |
Paper
arXiv: 2601.10611