WanJuan 2.0 (WanJuan-CC)
datasetA 1.0T token high-quality English webtext dataset derived from Common Crawl, also known as WanJuan 2.0. Specifically designed for pre-training large language models with a focus on safety and high information density.
Paper
arXiv: 2402.19282