MAI Multimodal Stack (Transcribe / Voice / Image)
modelMicrosoft's first major foundation model push, releasing three MAI models in Azure Foundry in April 2026 — a direct shot at OpenAI and Google dependencies — then refreshing the whole stack at Build 2026 (June 2) alongside the new MAI-Thinking-1 reasoning model and MAI-Code-1-Flash coding model.
Original April release: MAI-Transcribe-1 (STT, 25 languages, #1 by FLEURS in 11 core languages, beating Whisper-large-v3 and Gemini 3.1 Flash-Lite); MAI-Voice-1 (TTS, 60 s of audio in <1 s on a single GPU); MAI-Image-2 (T2I, debuted #3 on Arena.ai).
Build 2026 refresh: MAI-Transcribe-1.5 expands to 43 languages and retains its #1 FLEURS spot (adds content biasing, $0.36/hour); MAI-Voice-2 brings voice cloning and voice prompting to 15+ languages ($22/M characters), with a faster MAI-Voice-2-Flash coming soon; MAI-Image-2.5 adds image-to-image editing and re-debuts at #3 on Arena.ai for image families ($1.75/M input tokens, $33/M image output), with a more efficient MAI-Image-2.5-Flash shipped same day.
Marks Microsoft's transition from pure OpenAI dependency to building its own frontier multimodal stack under VP GenAI Mustafa Suleyman's MAI division. Proprietary, available via Microsoft Foundry plus the June 2026 distribution expansion to OpenRouter, Fireworks, and Baseten.
Model Details
Variants
| Name | Parameters | Notes |
|---|---|---|
| MAI-Transcribe-1 | — | Apr 2026; STT, 25 languages, |
| MAI-Voice-1 | — | Apr 2026; TTS, 60 s of audio in <1 s on single GPU |
| MAI-Image-2 | — | Apr 2026; T2I, |
| MAI-Transcribe-1.5 | — | Jun 2026; STT, 43 languages, content biasing, retains |
| MAI-Voice-2 | — | Jun 2026; TTS, 15+ languages, voice cloning + voice prompting, $22/M chars |
| MAI-Voice-2-Flash | — | Jun 2026 announce, coming soon; efficient TTS variant |
| MAI-Image-2.5 | — | Jun 2026; T2I + image editing, |
| MAI-Image-2.5-Flash | — | Jun 2026; efficient T2I variant |