58 diverse, programmatically verifiable output constraints for evaluating instruction-following generalization. Core finding: models strongly overfit to the 25 constraints in the older IFEval benchmark and fail on unseen constraint types. IFBench introduces constraints across 7 categories (count, ratio, words, sentence, format, custom, copy) combined with WildChat prompts to produce 300 evaluation prompts (Artificial Analysis runs a 294-prompt subset in its harness). All constraints verified automatically by code.

Also demonstrates that RLVR (reinforcement learning with verifiable rewards) substantially improves constraint compliance. Includes 29 hand-annotated training constraints with verification functions and RLVR prompts. Was a component of the Artificial Analysis Intelligence Index v4.0 (6.25% weight in the General category) but was removed in v4.1 (AA continues to run it on new model releases). NeurIPS 2025 Datasets & Benchmarks. By Pyatkin, Malik, Graf, Ivison, Huang, Dasigi, Lambert, and Hajishirzi (Ai2).

Paper

Venue NeurIPS 2025
Citations 1

Evaluation Details

Questions 300
Tasks 58
Domains 1
Scoring programmatic constraint verification functions; prompt-level strict and loose accuracy (paper reports prompt-level loose accuracy, temperature 0)
Used in: AA Intelligence Index v4.0 (removed in v4.1)
Domains: instruction following
benchmarkevaluationalignment