Topic Over Source: The Key to Effective Data Mixing for LLM Pre-training

Proposes topic-based data partitioning as a replacement for the standard source-based approach to pretraining data mixing. Uses unsupervised clustering, LLM-based summarization, and supervised classifier training to generate detailed topic labels for data organization. First comprehensive comparison across multiple mixing strategies (RegMix, DoReMi, temperature-based sampling).

Theoretical and empirical analysis shows topic-based mixing achieves significantly lower validation loss compared to source-based approaches, with consistent downstream improvements. Releases code, annotated datasets, and topic classification models. By Peng, Zhuang, Qiu, Ma, Yu, Zhu, and Conghui He (Shanghai AI Laboratory).

Paper (arXiv)

Paper

Citations 1

arXiv HTML

datascalingfoundational

Your notes

Paper