Scalable pipeline for curating LLM pretraining data from Common Crawl. Produces ~3.17 trillion tokens across four data types: general web text, code, math, and QA. Pipeline has three modules: Collection (extraction and deduplication), Filtering (quality classifiers and heuristic rules), and Extraction (domain-specific parsers for code, math, and QA content from web pages).

Open-sourced with full reproduction scripts. All datasets verified in scale and quality as comparable to Microsoft's internal pretraining data. By Chang, Cui, Dong, Huang, Wei et al. (Microsoft Research).

Paper

Dataset

dataopen-sourceinfrastructure