Policy Discriminative Learning (POLAR) — a pre-training approach that frames reward modeling as distinguishing between different policies. Achieves significant improvements in preference accuracy across tasks with predictable compute-performance scaling laws.

Paper

arXiv: 2507.05197

Library

GitHub Repository

reinforcement-learningreasoningresearch