Framework for constructing meaningful domain taxonomies over web corpora. Introduces topic and format taxonomies (24 categories each), distilling Llama-3.1-405B classifiers into efficient 140M models. Enables principled data curation by flexibly up/down-sampling domains for pre-training.

Paper

arXiv: 2502.10341

dataresearch

Related