Ultra-FineWeb
datasetOpenBMB's L2-selected open pretraining dataset and the filtering pipeline behind it. ~1.3T tokens — 1T English and 120B Chinese — distilled from FineWeb and Chinese FineWeb. Apache 2.0.
The companion paper "Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data" introduces an efficient verification strategy that estimates a candidate filter's downstream training impact at minimal compute, plus a lightweight fastText classifier that reproduces the resulting selection at scale. LLMs trained on Ultra-FineWeb show significant gains over their FineWeb baselines across multiple benchmarks.
Sits at the L2 layer of OpenBMB's broader UltraData L0–L4 tiered framework, with Ultra-FineWeb-L3 (May 2026) refining this corpus into synthetic Q&A pairs and multi-style rewrites. By Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, and Zhiyuan Liu.