The Illusion of Thinking

Systematic study of Large Reasoning Models (LRMs) using controllable puzzle environments with tunable complexity. Reveals that LRMs face complete accuracy collapse beyond certain complexity thresholds and exhibit a counterintuitive pattern: reasoning effort initially increases with problem difficulty, then declines despite available compute budget — the model "gives up" before exhausting its thinking tokens.

Identifies three performance regimes: (1) standard models suffice at low complexity, (2) LRMs show clear advantage at medium complexity, (3) both collapse at high complexity with no meaningful gap. Finds that LRMs struggle with exact computation and show inconsistent reasoning across scales, suggesting current chain-of-thought reasoning is more brittle than benchmark scores imply. NeurIPS 2025. By Shojaee, Mirzadeh, Alizadeh, Horton, Bengio, and Farajtabar (Apple).

Paper (arXiv)

Paper

Venue NeurIPS 2025

arXiv HTML

foundationalreasoningresearch