Two-stage reinforcement learning framework for LLMs that surpasses DeepSeek-R1-Zero-32B on AIME24 and LiveCodeBench with only 1/10 of the training steps.

Paper

arXiv: 2504.14286

reasoningtrainingefficiency

Related