Studies how LLMs can use self-generated tests to debug their own code, comparing two paradigms: post-execution self-debugging (generate code, run tests, fix) and in-execution self-debugging (use intermediate execution states to guide correction). Finds that post-execution debugging actually hurts performance on basic problems due to test generation bias (models generate unreliable tests ~56-59% of the time), but in-execution debugging mitigates this by grounding corrections in actual runtime state.

Evaluated on GPT-4o, Claude-3.5-Sonnet, Llama-3-70B, and Qwen2.5-Coder-7B across HumanEval, MBPP, and LiveCodeBench. In-execution debugging yields consistent gains (+1.2% HumanEval, +1.4% MBPP on GPT-4o) while post-execution debugging often degrades. Practical finding: self-debugging only reliably helps on competitive-level problems, not basic ones. By Chen, Tao, Zhang, Zhou, Gu, He, Zhang, Cai, Zhao, Jin (Meituan + Peking University + Beijing Institute of Technology).

Paper

codingresearch