Zyda-2 | Lab Index

Open 5-trillion-token pretraining dataset built with NVIDIA NeMo Curator. Combines and re-curates several leading open corpora (FineWeb-Edu, DCLM, Dolma) with cross-source deduplication and quality filtering. Used by Zyphra to train Zamba2 and ZAYA1, and openly released for the broader community.

One of the few open pretraining corpora at the multi-trillion-token scale, alongside FineWeb / DCLM / RedPajama. ~25k HuggingFace downloads as of May 2026.

No results found