MiniCPM | Lab Index

The original MiniCPM series proving that small models can rival much larger ones. This work introduced the **Warmup-Stable-Decay (WSD)** learning rate scheduler, which popularized the concept of **midtraining** (or annealing). By maintaining a high learning rate for a "stable" period and only decaying in the final 10% of training while introducing high-quality data, the 2.4B model achieved performance parity with 7B-13B models. This scheduler also enables continuous training and efficient scaling law research without pre-defined token budgets.

Paper (arXiv)GitHub HuggingFace (2B)HuggingFace (MoE)

No results found