Introduces Think-Anywhere, enabling LLMs to invoke reasoning blocks at arbitrary token positions during code generation, rather than only before implementation begins. Unlike "upfront thinking" (reasoning before code) or fixed interleaving (reasoning at each step), Think-Anywhere adaptively allocates computational resources based on the immediate complexity of the code being generated.

Uses a two-stage training pipeline: cold-start SFT on ~5,000 samples constructed via Gemini 2.5 Flash, followed by reinforcement learning with verifiable rewards (RLVR) using GRPO with hierarchical rewards measuring both reasoning structure adherence and code correctness. Introduces semantic-aware trigger token initialization combining conceptual meaning with structural delimiter roles.

Results on Qwen3-Coder-30B-A3B: LeetCode 69.4% (+18.8pp over base), HumanEval 91.5% (+3.1pp), MBPP 82.9% (+12.2pp), LiveCodeBench 37.2% (+2.9pp). Also transfers to mathematical reasoning: AIME 2024 17.3% (vs 5.3% baseline). Joint work between Peking University and Alibaba Tongyi Lab.

Paper

arXiv: 2603.29957

codingreasoningresearch