TxT360
datasetFirst pretraining corpus to globally deduplicate 99 Common Crawl snapshots together with 14 high-quality data sources (FreeLaw, PG-19, etc.), producing ~5T tokens of deduplicated text. An upsampling recipe creates a 15T+ token training corpus that outperforms FineWeb 15T on several key metrics. The filtered variant TxT360-BestOfWeb applies the ProX document filter model + format scoring to retain ~22% of the highest-quality pages.
Part of the LLM360 open-source initiative (MBZUAI + Petuum) focused on fully transparent LLM development. Used to train the K2 model family. By Tang, Ranjan, Pangarkar, Liang, Wang, Ma, Liu, and Xing (MBZUAI / LLM360).