Open multimodal VLM family with fully open training data (PixMo). Pioneered image pointing capabilities. Trained in two stages: dense captioning pre-training + supervised fine-tuning for QA, document reading, and pointing. Closes the gap between open and proprietary multimodal systems. Published at CVPR 2025.

Model Details

Architecture DENSE

Paper

arXiv: 2409.17146

Venue: CVPR 2025

multimodalvisionopen-sourceopen-weight

Related