Ultra-FineWeb

OpenBMB's L2-selected open pretraining dataset and the filtering pipeline behind it. ~1.3T tokens — 1T English and 120B Chinese — distilled from FineWeb and Chinese FineWeb. Apache 2.0.

The companion paper "Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data" introduces an efficient verification strategy that estimates a candidate filter's downstream training impact at minimal compute, plus a lightweight fastText classifier that reproduces the resulting selection at scale. LLMs trained on Ultra-FineWeb show significant gains over their FineWeb baselines across multiple benchmarks.

Sits at the L2 layer of OpenBMB's broader UltraData L0–L4 tiered framework, with Ultra-FineWeb-L3 (May 2026) refining this corpus into synthetic Q&A pairs and multi-style rewrites. By Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, and Zhiyuan Liu.

Paper (arXiv)Dataset Classifier

Paper

arXiv HTML

Authors: Yudong Wang · Zixuan Fu · Jie Cai · Peijun Tang · Hongya Lyu · Yewei Fang · Zhi Zheng · Jie Zhou · Guoyang Zeng · Chaojun Xiao · Xu Han · Zhiyuan Liu

Dataset

Size 1.3T tokens (1T English + 120B Chinese)

Format text

License Apache 2.0

Downloads 30

Languages: English, Chinese

HuggingFace

training-datatrainingmultilingualopen-weight

Your notes

Paper

Dataset

Related