Large-scale multimodal instruction dataset with 40M+ samples for training vision-language models. Ensures diversity and accuracy through quality filtering and deduplication. Includes a synthetic instruction generation method based on a tagging system and open-source VLMs. Used to train Aquila-VL-2B, achieving SOTA performance among models of the same scale. Paper updated January 6, 2025.
datasetmultimodaltrainingopen-source

Related