Introduced dynamic high-resolution processing (up to 4K via 1-40 tiles of 448x448), continuous learning for InternViT-6B, and high-quality bilingual training data. SOTA on 8 of 18 benchmarks, closing the gap to GPT-4V.

Model Details

Architecture DENSE
Parameters 26B

Paper

arXiv: 2404.16821

multimodalopen-weightvision

Related