Largest Japanese web training corpus. 312.1B characters / 173M pages from Common Crawl, Wikipedia, and archived technical papers (Kaken). Includes specialized filtering for Japanese text quality. Evolved through v1-v4 across LLM-jp generations.

Paper

arXiv: 2404.17733

Dataset

GitHub Repository

dataopen-sourcemultilingual

Related