InternVL: Scaling up Vision Foundation Models
modelFoundational vision-language model that scaled ViT to 6B parameters (InternViT-6B) and aligned it with an LLM using web-scale image-text data. Achieved SOTA on 32 visual-linguistic benchmarks. Published at CVPR 2024 as Oral.
Model Details
Architecture DENSE
Parameters 6B
Paper
arXiv: 2312.14238
Venue: CVPR 2024