MAI Multimodal Stack (Transcribe / Voice / Image)

Microsoft's first major foundation model push, releasing three MAI models in Azure Foundry in April 2026 — a direct shot at OpenAI and Google dependencies — then refreshing the whole stack at Build 2026 (June 2) alongside the new MAI-Thinking-1 reasoning model and MAI-Code-1-Flash coding model.

Original April release: MAI-Transcribe-1 (STT, 25 languages, #1 by FLEURS in 11 core languages, beating Whisper-large-v3 and Gemini 3.1 Flash-Lite); MAI-Voice-1 (TTS, 60 s of audio in <1 s on a single GPU); MAI-Image-2 (T2I, debuted #3 on Arena.ai).

Build 2026 refresh: MAI-Transcribe-1.5 expands to 43 languages and retains its #1 FLEURS spot (adds content biasing, $0.36/hour); MAI-Voice-2 brings voice cloning and voice prompting to 15+ languages ($22/M characters), with a faster MAI-Voice-2-Flash coming soon; MAI-Image-2.5 adds image-to-image editing and re-debuts at #3 on Arena.ai for image families ($1.75/M input tokens, $33/M image output), with a more efficient MAI-Image-2.5-Flash shipped same day.

Marks Microsoft's transition from pure OpenAI dependency to building its own frontier multimodal stack under VP GenAI Mustafa Suleyman's MAI division. Proprietary, available via Microsoft Foundry plus the June 2026 distribution expansion to OpenRouter, Fireworks, and Baseten.

April 2026 announcement (v1 / v1 / v2)June 2026 announcement (Build 2026, v1.5 / v2 / v2.5)MAI-Transcribe-1 deep dive

Model Details

Variants

Name	Parameters	Notes
MAI-Transcribe-1	—	Apr 2026; STT, 25 languages,
MAI-Voice-1	—	Apr 2026; TTS, 60 s of audio in <1 s on single GPU
MAI-Image-2	—	Apr 2026; T2I,
MAI-Transcribe-1.5	—	Jun 2026; STT, 43 languages, content biasing, retains
MAI-Voice-2	—	Jun 2026; TTS, 15+ languages, voice cloning + voice prompting, $22/M chars
MAI-Voice-2-Flash	—	Jun 2026 announce, coming soon; efficient TTS variant
MAI-Image-2.5	—	Jun 2026; T2I + image editing,
MAI-Image-2.5-Flash	—	Jun 2026; efficient T2I variant

frontiermultimodalspeechvision

MAI Multimodal Stack (Transcribe / Voice / Image)

Your notes

Model Details

Variants

Related