~5 trillion token open web corpus from CommonCrawl demonstrating that properly filtered web data alone can outperform curated corpora. 600B token public extract released under ODC-By 1.0. Training data for all Falcon models.

Paper

arXiv: 2306.01116

Dataset

Dataset

dataopen-source

Related