Topic Over Source: The Key to Effective Data Mixing for LLM Pre-training
paperProposes topic-based data partitioning as a replacement for the standard source-based approach to pretraining data mixing. Uses unsupervised clustering, LLM-based summarization, and supervised classifier training to generate detailed topic labels for data organization. First comprehensive comparison across multiple mixing strategies (RegMix, DoReMi, temperature-based sampling).
Theoretical and empirical analysis shows topic-based mixing achieves significantly lower validation loss compared to source-based approaches, with consistent downstream improvements. Releases code, annotated datasets, and topic classification models. By Peng, Zhuang, Qiu, Ma, Yu, Zhu, and Conghui He (Shanghai AI Laboratory).