The original MiniCPM series proving that small models can rival much larger ones. This work introduced the **Warmup-Stable-Decay (WSD)** learning rate scheduler, which popularized the concept of **midtraining** (or annealing). By maintaining a high learning rate for a "stable" period and only decaying in the final 10% of training while introducing high-quality data, the 2.4B model achieved performance parity with 7B-13B models. This scheduler also enables continuous training and efficient scaling law research without pre-defined token budgets.

Outputs 3

MiniCPM-1B / 2B

model
Architecture DENSE

Variants

Name Parameters Notes
MiniCPM-1B 1B
MiniCPM-2B 2B

MiniCPM: Unveiling the Potential of End-Side Large Language Models

paper

arXiv: 2404.06395

MiniCPM-MoE-8x2B

model

MoE version delivering 7B-class performance with significantly lower active parameter costs.

Architecture MOE
on-deviceefficiencyopen-weightmoe

Related