AI Lab Tracker
Labs
Timeline
FineWeb-Mask
dataset
2025-12-31
ByteDance
1.5 trillion-token "distilled" subset of common crawl data optimized for pre-training.
HuggingFace
Paper (arXiv)
training-data
training