First open-source multimodal LLM to surpass 70% on MMMU. Investigates scaling of vision encoders, language models, datasets, and test-time Chain-of-Thought reasoning. Also retroactively documents InternVL 2.0.

Model Details

Architecture DENSE

Variants

Name Parameters Notes
InternVL2_5-1B 1B
InternVL2_5-8B 8B
InternVL2_5-38B 38B
InternVL2_5-78B 78B

Paper

arXiv: 2412.05271

multimodalopen-weightvisionfrontierscaling

Related