RefinedWeb
dataset~5 trillion token open web corpus from CommonCrawl demonstrating that properly filtered web data alone can outperform curated corpora. 600B token public extract released under ODC-By 1.0. Training data for all Falcon models.
Paper
arXiv: 2306.01116