WanJuan 3.0 (WanJuan-SiLu)
datasetA comprehensive multilingual dataset focusing on "Silk Road" languages, containing over 300 billion tokens. It features major language subsets for Thai, Russian, Arabic, Korean, and Vietnamese to support globally-aligned large models.
Paper
arXiv: 2501.14506