Lightning OPD: Efficient Post-Training for Large Reasoning Models

Proposes offline on-policy distillation (OPD) that eliminates the need for live teacher inference during LLM post-training. Identifies the "teacher consistency" condition: using different teacher models for SFT and OPD introduces irreducible gradient bias that cannot be corrected by precomputing alone. The fix is simple — use the same teacher for both stages — but the failure mode was previously undiagnosed.

Lightning OPD precomputes teacher log-probabilities over training data, achieving 4× speedup over standard OPD. Trains Qwen3-8B-Base to 69.9% on AIME 2024 in 30 GPU hours. Substantially lowers the compute barrier for academic-scale reasoning-model post-training. By Yecheng Wu, Song Han, and Han Cai (NVIDIA).

Paper (arXiv)

Paper

arXiv HTML

foundationalreasoningefficiency