Flexible open-source framework for comprehensive multimodal model evaluation across vision-language understanding and generation tasks (VQA, text-to-image/video, image-text retrieval). Key design: decouples model inference from evaluation through an independent evaluation service, enabling flexible resource allocation and seamless integration of new tasks. Uses vLLM and SGLang for inference acceleration. Presented at ACL 2025 demo track. Supports 15+ benchmarks including MMMU, MathVision, MMVET-v2.

Library

GitHub Repository

evaluationmultimodalopen-sourcebenchmark

Related