FlagEvalMM | Lab Index

Flexible open-source framework for comprehensive multimodal model evaluation across vision-language understanding and generation tasks (VQA, text-to-image/video, image-text retrieval). Key design: decouples model inference from evaluation through an independent evaluation service, enabling flexible resource allocation and seamless integration of new tasks. Uses vLLM and SGLang for inference acceleration. Presented at ACL 2025 demo track. Supports 15+ benchmarks including MMMU, MathVision, MMVET-v2.

Paper (arXiv)GitHub ACL 2025 Demo

Library

Stars 101

GitHub Repository →

evaluationmultimodalopen-sourcebenchmark

Library

Related