InstructGPT (RLHF)
paper"Training language models to follow instructions with human feedback" — introduced the three-stage RLHF pipeline that became industry standard: (1) supervised fine-tuning (SFT) on human demonstrations, (2) training a reward model on human preference comparisons, (3) optimizing the policy with PPO against the reward model. Applied to GPT-3 (1.3B and 175B variants).
InstructGPT showed that a 1.3B model aligned with RLHF could be preferred over an unaligned 175B model, demonstrating that alignment is not just about scale. This pipeline — SFT → RM → PPO — was the direct precursor to ChatGPT and is now used by virtually every aligned LLM worldwide. NeurIPS 2022. By Ouyang, Wu et al.
Paper
arXiv: 2203.02155
Venue: NeurIPS 2022