Systematic data curation framework with L0-L9 quality taxonomy. Darwin-Science: 900B-token scientific corpus (+5.60/+8.40 on domain tasks). Darwin-CC: 504B tokens from 672B across 8 categories, 30 iterations per category. Surpasses DCLM, Ultra-FineWeb, and FineWeb-Edu.

Darwin-CC: 1.02B HuggingFace downloads, 3K+ likes.

Paper

arXiv: 2602.07824

Dataset

GitHub Repository

dataopen-sourceresearch

Related