Zonos2 (ZONOS2)
modelZyphra's open-weight multilingual text-to-speech model and successor to Zonos — an MoE backbone (16 experts, top-1 routing; 28 layers, GQA) generating DAC audio tokens from UTF-8 byte input, with ECAPA-TDNN speaker embeddings for zero-shot voice cloning. Trained on 6M+ hours of multilingual speech; Apache-2.0, with a Mini-SGLang-based inference server for low-latency synthesis.
Three-tier language coverage: Tier 1 English / Mandarin / Japanese, Tier 2 adds Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, Dutch, and Tier 3 a further ~23 languages. Extends Zyphra's foundation-model line beyond language and EEG into speech generation.
Model Details
Architecture MOE
Experts 16 (top-1)
License Apache 2.0