Video-SafetyBench

First comprehensive safety benchmark for video LVLMs. 2,264 video-text pairs covering 13 unsafe categories and 48 fine-grained subcategories, each pairing a synthesized ~10s video with either a harmful or a benign query. Introduces RJScore (RiskJudgeScore), an LLM-based metric that uses token-level logit distributions to capture judge confidence and align with human safety judgments.

Joint work between BAAI FlagEval and Beijing University of Posts and Telecommunications. Accepted to NeurIPS 2025 Datasets & Benchmarks track.

Paper (arXiv)GitHub HuggingFace dataset OpenReview

Paper

Venue NeurIPS 2025 D&B

arXiv HTML

Evaluation Details

Questions 2,264

Domains 2

Scoring RJScore (LLM-judge logit distributions)

Domains: video safety, video-text multimodal

evaluationbenchmarksafetymultimodal

Your notes

Paper

Evaluation Details

Related