Zyda-2
datasetOpen 5-trillion-token pretraining dataset built with NVIDIA NeMo Curator. Combines and re-curates several leading open corpora (FineWeb-Edu, DCLM, Dolma) with cross-source deduplication and quality filtering. Used by Zyphra to train Zamba2 and ZAYA1, and openly released for the broader community.
One of the few open pretraining corpora at the multi-trillion-token scale, alongside FineWeb / DCLM / RedPajama. ~25k HuggingFace downloads as of May 2026.
Dataset
Size 5T tokens
Format text