ZR1-1.5B
modelCompact reasoning model post-trained from DeepSeek-R1-Distill-Qwen-1.5B using PRIME (Process Reinforcement through IMplicit rEwards) with token-level RLOO on ~400k math + ~25k code samples (NuminaMath-CoT, APPS, CodeContests, TACO, Codeforces). Generation length ramped 12k → 24k tokens over training.
Self-reported numbers: 88.34% MATH-500, 37.91% GPQA-Diamond, ~40% LeetCode — >50% improvement over the R1-Distill base at the same parameter count. An applied study in token-efficient reasoning at small scale. Not currently scored on Artificial Analysis.
Model Details
Architecture DENSE
Parameters 1.5B
Base model deepseek-r1
Benchmark Scores
| Benchmark | Score | Mode |
|---|---|---|
| MATH-500 | 88.34% | — |
| GPQA Diamond | 37.91% | — |
| LeetCode | 40% | — |