Demonstrated that process reward models (PRMs) that evaluate each step of a reasoning chain significantly outperform outcome reward models (ORMs) that only evaluate the final answer. Released PRM800K, a dataset of 800K step-level human feedback labels on mathematical reasoning.

Process supervision achieved 78.2% on MATH (vs 72.4% for outcome supervision) using best-of-N sampling with GPT-4. This work was a key precursor to the o1 and o3 reasoning models, establishing that rewarding correct reasoning steps — not just correct answers — is critical for reliable mathematical reasoning. By Lightman, Kosaraju, Burda et al.

Paper

arXiv: 2305.20050

reasoningalignmentfoundational

Related