MAI Foundation Models
modelMicrosoft's first major foundation model push, releasing three MAI models in Azure Foundry — a direct shot at OpenAI and Google dependencies. MAI-Transcribe-1 (speech-to-text, 25 languages) ranks #1 by FLEURS in 11 core languages, beating Whisper-large-v3 and Gemini 3.1 Flash-Lite. MAI-Voice-1 (text-to-speech) generates 60 seconds of audio in under 1 second on a single GPU. MAI-Image-2 (text-to-image) debuted #3 on Arena.ai's image leaderboard.
Marks Microsoft's transition from pure OpenAI dependency to building its own frontier multimodal models internally under VP GenAI Mustafa Suleyman's MAI division. Proprietary, available via Azure Foundry.
Model Details
Variants
| Name | Parameters | Notes |
|---|---|---|
| MAI-Transcribe-1 | — | Speech-to-text, 25 languages, beats Whisper-large-v3 |
| MAI-Voice-1 | — | TTS, 60s audio in <1s on single GPU |
| MAI-Image-2 | — | Text-to-image, |