InstructGPT (RLHF)

"Training language models to follow instructions with human feedback" — introduced the three-stage RLHF pipeline that became industry standard: (1) supervised fine-tuning (SFT) on human demonstrations, (2) training a reward model on human preference comparisons, (3) optimizing the policy with PPO against the reward model. Applied to GPT-3 (1.3B and 175B variants).

InstructGPT showed that a 1.3B model aligned with RLHF could be preferred over an unaligned 175B model, demonstrating that alignment is not just about scale. This pipeline — SFT → RM → PPO — was the direct precursor to ChatGPT and is now used by virtually every aligned LLM worldwide. NeurIPS 2022. By Ouyang, Wu et al.

Paper (arXiv)Announcement

Paper

Venue NeurIPS 2022

arXiv HTML

alignmentreinforcement-learningfoundational

Paper

Related