Sci-Base | Lab Index

Open scientific pretraining corpus from OpenDataLab (Shanghai AI Lab ecosystem). 25M+ scientific documents / 600B+ tokens / 3.87 TB across 10 disciplines, parsed end-to-end with MinerU. Knowledge cutoff March 2026. CC-BY 4.0.

Comparable in scale to FineWeb-class general corpora but science-restricted — positioned as a foundation-scale dataset for training science-FM and research-assistant models. Companion ships SA-RxnDiagram-15k (reaction diagrams) and SA-Prot-annot (protein annotations).

HuggingFace

Dataset

Size 600B+ tokens / 3.87 TB / 25M+ documents

Format text

License CC-BY 4.0

HuggingFace

training-datatrainingscienceopen-weight

Your notes

Dataset