BLIP-2
model"Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." Introduces Q-Former, a lightweight Transformer that bridges frozen image encoders and LLMs. Achieves SOTA on visual QA, captioning, and image-text retrieval.
BLIP-2 demonstrated efficient vision-language pre-training by leveraging frozen pre-trained models rather than end-to-end training. ICML 2023. Extended by xGen-MM (BLIP-3). By Li, Li, Savarese, and Hoi.
Paper
arXiv: 2301.12597
Venue: ICML 2023