Proximal Policy Optimization (PPO)

Introduced Proximal Policy Optimization, a family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent. PPO uses a clipped surrogate objective that prevents destructively large policy updates while remaining simple to implement.

PPO became the default reinforcement learning algorithm for RLHF in language model alignment, used in InstructGPT, ChatGPT, and virtually every major aligned LLM. One of the most widely-adopted RL algorithms ever created, with applications spanning robotics, game playing, and LLM post-training. By Schulman, Wolski, Dhariwal, Radford, and Klimov.

Paper (arXiv)

Paper

arXiv HTML

reinforcement-learningalignmentfoundational