Largest fully open pretraining corpus at release: ~3T tokens from Common Crawl (2,180B), GitHub code (342B), Reddit (80B), Semantic Scholar/peS2o (57B), Gutenberg + Wikipedia (~9B). Includes the Dolma Toolkit for reproducible data curation. Evolved through Dolma 1.7 (OLMo 2) and Dolma 3 (OLMo 3, ~9.3T token pool with olmOCR). ODC-BY.

Paper

arXiv: 2402.00159

Dataset

GitHub Repository

dataopen-source

Related