Proposes CodePRM, a process reward model that uses code execution feedback to score individual reasoning steps during code generation. Collects thought traces where each step is labeled with derived code pass rates and code snippets, then trains a PRM to take both the reasoning process and execution feedback as input.

Introduces the Generate-Verify-Refine (GVR) inference pipeline where CodePRM serves as a process verifier that dynamically identifies and corrects errors in the reasoning chain during code search. Outperforms strong baselines on code generation benchmarks. ACL 2025 Findings. By Li, Dai, Li, Zhang, Wang, Tang, Yu (SJTU + Huawei Noah's Ark Lab).

Paper

Venue ACL 2025 Findings
codingreasoningreinforcement-learning