sDPO | Lab Index

Stepwise DPO: partitions preference data into T ordered chunks and performs DPO T times sequentially, using each step's output as the reference model for the next. An easy-to-hard ordering (sorted by reward accuracy) progressively tightens the optimization bound, creating a natural curriculum.

On SOLAR 10.7B, sDPO achieved H4: 74.31 (vs 72.67 for standard DPO), outperforming Mixtral 8x7B-Instruct (73.40) and SOLAR-0-70B (72.93). EQ Bench improved by +7.81 points. The method requires no additional data and is a drop-in replacement for standard DPO.

Paper (arXiv)

Paper

arXiv HTML

trainingalignmentresearch

Paper

Related