TTRL: Test-Time Reinforcement Learning

A method for reinforcement learning on unlabeled data — training LLM reasoning at test time with no ground-truth rewards. The core insight: common test-time-scaling practices, especially majority voting (maj@n), yield surprisingly effective reward estimates that can drive RL, letting a model self-evolve from its pretrained priors on the very (unlabeled) test data it faces.

TTRL boosts Qwen-2.5-Math-7B pass@1 by ~211% on AIME 2024 using only unlabeled test data, with consistent gains across tasks and models; remarkably, although supervised only by the maj@n signal, it approaches the performance of models trained directly on ground-truth labels. NeurIPS 2025. By Tsinghua University and Shanghai AI Lab (equal-contribution leads Kaiyan Zhang and Ganqu Cui; senior author Bowen Zhou); code released as PRIME-RL/TTRL.

Paper (arXiv)GitHub

Paper

Venue NeurIPS 2025

arXiv HTML Code

Authors: Yuxin Zuo · Kaiyan Zhang · Ganqu Cui · Ning Ding · Bowen Zhou

reinforcement-learningpost-trainingreasoningresearch

Your notes

Paper