A comprehensive multilingual dataset focusing on "Silk Road" languages, containing over 300 billion tokens. It features major language subsets for Thai, Russian, Arabic, Korean, and Vietnamese to support globally-aligned large models.

Paper

arXiv: 2501.14506

training-datatrainingnlp

Related