Sci-Base
datasetOpen scientific pretraining corpus from OpenDataLab (Shanghai AI Lab ecosystem). 25M+ scientific documents / 600B+ tokens / 3.87 TB across 10 disciplines, parsed end-to-end with MinerU. Knowledge cutoff March 2026. CC-BY 4.0.
Comparable in scale to FineWeb-class general corpora but science-restricted — positioned as a foundation-scale dataset for training science-FM and research-assistant models. Companion ships SA-RxnDiagram-15k (reaction diagrams) and SA-Prot-annot (protein annotations).
Dataset
Size 600B+ tokens / 3.87 TB / 25M+ documents
Format text
License CC-BY 4.0