Ultra-FineWeb-L3

L3 refinement of Ultra-FineWeb — the same L2-curated text rewritten into more learnable training samples with explicit reasoning signals. 400B+ English and 200B+ Chinese tokens total, organized into two synthesis tracks.

Q&A Synthetic (245B English / 118B Chinese): each web document is rewritten as "original + multiple Q&A pairs" with self-contained questions over core concepts, factual details, and logical relations. The original text is prepended during training so the synthetic Q&A acts as an explicit knowledge-organization signal rather than a replacement.
Multi-Style Synthetic (164B English / 82B Chinese): single-source content rewritten into encyclopedia (modular, concise, objective), textbook ("definition → theorem → proof → example"), blog (conversational with analogies), and abstract (core-argument extraction) styles for broader expression coverage at the same factual base.

The Chinese portion is reportedly the largest open-source Chinese pre-training synthetic dataset to date. Apache 2.0. Sits at the L3 layer of OpenBMB's UltraData L0–L4 tiered framework, refining the L2 selection from Ultra-FineWeb (June 2025).

Dataset (HuggingFace)Ultra-FineWeb paper (arXiv, L2 methodology)

Dataset

Size 400B+ English + 200B+ Chinese tokens (synthetic refinement)

Format text

License Apache 2.0

Languages: English, Chinese

HuggingFace

training-datatrainingmultilingualopen-weight

Your notes

Dataset

Related