InternVL: Scaling up Vision Foundation Models

Foundational vision-language model that scaled ViT to 6B parameters (InternViT-6B) and aligned it with an LLM using web-scale image-text data. Achieved SOTA on 32 visual-linguistic benchmarks. Published at CVPR 2024 as Oral.

No results found