Ultra-FineWeb-L3
datasetL3 refinement of Ultra-FineWeb — the same L2-curated text rewritten into more learnable training samples with explicit reasoning signals. 400B+ English and 200B+ Chinese tokens total, organized into two synthesis tracks.
- Q&A Synthetic (245B English / 118B Chinese): each web document is rewritten as "original + multiple Q&A pairs" with self-contained questions over core concepts, factual details, and logical relations. The original text is prepended during training so the synthetic Q&A acts as an explicit knowledge-organization signal rather than a replacement.
- Multi-Style Synthetic (164B English / 82B Chinese): single-source content rewritten into encyclopedia (modular, concise, objective), textbook ("definition → theorem → proof → example"), blog (conversational with analogies), and abstract (core-argument extraction) styles for broader expression coverage at the same factual base.
The Chinese portion is reportedly the largest open-source Chinese pre-training synthetic dataset to date. Apache 2.0. Sits at the L3 layer of OpenBMB's UltraData L0–L4 tiered framework, refining the L2 selection from Ultra-FineWeb (June 2025).
Dataset
Size 400B+ English + 200B+ Chinese tokens (synthetic refinement)
Format text
License Apache 2.0
Languages: English, Chinese