Model compression technique combining depth pruning, width pruning, and distillation. Compressed Llama 3.1 8B to 4B and Mistral NeMo 12B to 8B with 1.2-2.7x speedup and minimal quality loss. Later evolved into MiniPuzzle (used in Nemotron-H 47B). Accepted at ICLR 2025.

Paper

arXiv: 2408.11796

Venue: ICLR 2025

Library

GitHub Repository

efficiencyresearch

Related