BLIP-2 | Lab Index

"Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." Introduces Q-Former, a lightweight Transformer that bridges frozen image encoders and LLMs. Achieves SOTA on visual QA, captioning, and image-text retrieval.

BLIP-2 demonstrated efficient vision-language pre-training by leveraging frozen pre-trained models rather than end-to-end training. ICML 2023. Extended by xGen-MM (BLIP-3). By Li, Li, Savarese, and Hoi.

No results found