"Multimodal Autoregressive Pre-training of Large Vision Encoders." Pairs a vision encoder with a multimodal decoder that generates both raw image patches and text tokens. AIMv2-3B achieves 89.5% on ImageNet-1k with frozen trunk.

Consistently outperforms CLIP and SigLIP in multimodal understanding while scaling more efficiently — outperforming SOTA with fewer training samples. CVPR 2025 Highlight. By Fini, Shukor, Li, Susskind, El-Nouby et al. Apache 2.0.

Paper

arXiv: 2411.14402

Venue: CVPR 2025

visionmultimodalopen-sourcefoundational

Related