Speech model family. Voxtral Mini (3B) and Small (24B) handle long-form audio understanding (30-40 minutes). Voxtral TTS (4B) adds zero-shot voice cloning. Natively multilingual. Apache 2.0.

Paper

arXiv: 2507.13264

audiomultimodalopen-weight