58 diverse, programmatically verifiable output constraints for evaluating instruction-following generalization. Core finding: models strongly overfit to the 25 constraints in the older IFEval benchmark and fail on unseen constraint types. IFBench introduces constraints across 7 categories (count, ratio, words, sentence, format, custom, copy) combined with WildChat prompts to produce 294 evaluation tasks. All constraints verified automatically by code.

Also demonstrates that RLVR (reinforcement learning with verifiable rewards) substantially improves constraint compliance. Includes 29 hand-annotated training constraints with verification functions and RLVR prompts. Used in the Artificial Analysis Intelligence Index v4.0 (6.25% weight in the General category). NeurIPS 2025 Datasets & Benchmarks. By Pyatkin, Malik, Graf, Ivison, Huang, Dasigi, Lambert, and Hajishirzi (Ai2).

Paper

Venue NeurIPS 2025
benchmarkevaluationalignment