Post-training framework introducing RLVR (Reinforcement Learning with Verifiable Rewards). Applied to Llama 3.1 at 8B, 70B, and 405B. Surpasses instruct versions of Llama 3.1, Qwen 2.5, Mistral, and closed models GPT-4o-mini and Claude 3.5-Haiku.

At release, no model in LMSYS ChatBot Arena top-50 had published post-training data. Tülu 3 releases all datasets, training code, and recipes. Comprehensive decontamination of open datasets. Apache 2.0.

Paper

arXiv: 2411.15124

open-sourcetrainingalignmentresearch

Related