AIMv2
paper"Multimodal Autoregressive Pre-training of Large Vision Encoders." Pairs a vision encoder with a multimodal decoder that generates both raw image patches and text tokens. AIMv2-3B achieves 89.5% on ImageNet-1k with frozen trunk.
Consistently outperforms CLIP and SigLIP in multimodal understanding while scaling more efficiently — outperforming SOTA with fewer training samples. CVPR 2025 Highlight. By Fini, Shukor, Li, Susskind, El-Nouby et al. Apache 2.0.
Paper
arXiv: 2411.14402
Venue: CVPR 2025