IFBench
eval58 diverse, programmatically verifiable output constraints for evaluating instruction-following generalization. Core finding: models strongly overfit to the 25 constraints in the older IFEval benchmark and fail on unseen constraint types. IFBench introduces constraints across 7 categories (count, ratio, words, sentence, format, custom, copy) combined with WildChat prompts to produce 300 evaluation prompts (Artificial Analysis runs a 294-prompt subset in its harness). All constraints verified automatically by code.
Also demonstrates that RLVR (reinforcement learning with verifiable rewards) substantially improves constraint compliance. Includes 29 hand-annotated training constraints with verification functions and RLVR prompts. Was a component of the Artificial Analysis Intelligence Index v4.0 (6.25% weight in the General category) but was removed in v4.1 (AA continues to run it on new model releases). NeurIPS 2025 Datasets & Benchmarks. By Pyatkin, Malik, Graf, Ivison, Huang, Dasigi, Lambert, and Hajishirzi (Ai2).