Microsoft has released MAI-Transcribe-1, a speech-to-text model that achieves the lowest Word Error Rate on the FLEURS benchmark across the top 25 languages by Microsoft product usage — beating OpenAI’s Whisper-large-v3 on all 25 languages and Google’s Gemini 3.1 Flash on 22 of 25. The model averages a 3.8 percent WER across those 25 languages.
The release, announced Monday, is the latest shot fired in Microsoft’s internal AI build-out. The company has been steadily accelerating its in-house model programme over the last twelve months to reduce dependence on OpenAI’s API for first-party products — despite Microsoft’s US$13+ billion partnership position in OpenAI.
Benchmark Detail
FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) measures WER across 102 languages. Microsoft’s publication covers the top 25 by volume in Teams, Outlook, and Azure dictation use — English, Mandarin, Spanish, Arabic, French, Portuguese, German, Hindi, Japanese, Russian, Korean, Italian, Turkish, Polish, Dutch, Vietnamese, Indonesian, Thai, Swedish, Greek, Czech, Finnish, Hebrew, Hungarian, and Romanian.
Average WERs (lower is better):
- MAI-Transcribe-1: 3.8%
- Gemini 3.1 Flash: 4.6%
- Whisper-large-v3: 6.1%
Where It Beats Gemini
Microsoft’s paper credits MAI-Transcribe-1’s lead in 22 of 25 languages to:
- An expanded low-resource language corpus (especially Swahili, Vietnamese, Arabic dialects)
- Acoustic robustness training on noisy real-world Teams call audio
- A native punctuation + capitalisation decoder that reduces the post-processing cleanup that hurts WER in other models
The 3 languages where Gemini 3.1 Flash still leads are Mandarin, Japanese, and Korean — Google’s own training corpus strength in these languages is hard to beat.
Product Wiring
MAI-Transcribe-1 is being rolled into:
- Microsoft Teams live captions and meeting transcripts
- Outlook dictation
- Azure Speech-to-Text API
- Copilot voice features across Windows 11
Pricing on Azure is set aggressively below Whisper-large-v3 — explicitly framed as an enterprise-grade upgrade path.
The Bigger Picture
Microsoft’s MAI family now includes:
- MAI-1 (general frontier reasoning)
- MAI-Voice-1 (speech generation)
- MAI-Transcribe-1 (speech recognition, today)
The company has not disclosed parameter counts for the new model, but has said on earnings calls that “frontier-class capability at one-tenth the compute” is the design target across the family. This is Microsoft diversifying its AI dependency base at industrial scale.
Source: Microsoft Research Blog / VentureBeat















Leave a Reply