Nemotron-CLIMB: Clustering-based Iterative Data Mixture Bootstrapping
paper datasetAutomated framework for discovering optimal pretraining data mixtures through semantic clustering and iterative optimization. Uses a proxy model to evaluate candidate mixtures, clusters data into 20 semantic groups, then searches for the ideal combination through iterative refinement — replacing manual data curation with a principled optimization loop.
A 1B model trained on the optimized mixture (400B tokens) surpasses Llama-3.2-1B by 2.0%; domain-specific optimization yields 5% gains over random sampling. Releases two datasets: Nemotron-ClimbLab (1.2T tokens, 20 semantic clusters for research) and Nemotron-ClimbMix (400B tokens, optimized mixture for efficient pretraining). NeurIPS 2025.
Outputs 3
Nemotron-ClimbLab (1.2T tokens)
dataset1.2 trillion token filtered corpus organized into 20 semantic clusters for data mixture research.
Nemotron-ClimbMix (400B tokens)
dataset400 billion token curated mixture optimized for efficient pretraining. Outperforms Llama-3.2-1B training data at equivalent token budgets.