OpenBMB's L2-selected open pretraining dataset and the filtering pipeline behind it. ~1.3T tokens — 1T English and 120B Chinese — distilled from FineWeb and Chinese FineWeb. Apache 2.0.

The companion paper "Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data" introduces an efficient verification strategy that estimates a candidate filter's downstream training impact at minimal compute, plus a lightweight fastText classifier that reproduces the resulting selection at scale. LLMs trained on Ultra-FineWeb show significant gains over their FineWeb baselines across multiple benchmarks.

Sits at the L2 layer of OpenBMB's broader UltraData L0–L4 tiered framework, with Ultra-FineWeb-L3 (May 2026) refining this corpus into synthetic Q&A pairs and multi-style rewrites. By Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, and Zhiyuan Liu.

Paper

Authors: Yudong Wang · Zixuan Fu · Jie Cai · Peijun Tang · Hongya Lyu · Yewei Fang · Zhi Zheng · Jie Zhou · Guoyang Zeng · Chaojun Xiao · Xu Han · Zhiyuan Liu

Dataset

Size 1.3T tokens (1T English + 120B Chinese)
Format text
License Apache 2.0
Downloads 25
Languages: English, Chinese
training-datatrainingmultilingualopen-weight

Related