Microsoft Drops MAI-Transcribe-1 — Beats OpenAI Whisper on All 25 Languages, Gemini Flash on 22 of 25

Microsoft has released MAI-Transcribe-1, a speech-to-text model that achieves the lowest Word Error Rate on the FLEURS benchmark across the top 25 languages by Microsoft product usage — beating OpenAI’s Whisper-large-v3 on all 25 languages and Google’s Gemini 3.1 Flash on 22 of 25. The model averages a 3.8 percent WER across those 25 languages.

The release, announced Monday, is the latest shot fired in Microsoft’s internal AI build-out. The company has been steadily accelerating its in-house model programme over the last twelve months to reduce dependence on OpenAI’s API for first-party products — despite Microsoft’s US$13+ billion partnership position in OpenAI.

Benchmark Detail

FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) measures WER across 102 languages. Microsoft’s publication covers the top 25 by volume in Teams, Outlook, and Azure dictation use — English, Mandarin, Spanish, Arabic, French, Portuguese, German, Hindi, Japanese, Russian, Korean, Italian, Turkish, Polish, Dutch, Vietnamese, Indonesian, Thai, Swedish, Greek, Czech, Finnish, Hebrew, Hungarian, and Romanian.

Average WERs (lower is better):

  • MAI-Transcribe-1: 3.8%
  • Gemini 3.1 Flash: 4.6%
  • Whisper-large-v3: 6.1%

Where It Beats Gemini

Microsoft’s paper credits MAI-Transcribe-1’s lead in 22 of 25 languages to:

  1. An expanded low-resource language corpus (especially Swahili, Vietnamese, Arabic dialects)
  2. Acoustic robustness training on noisy real-world Teams call audio
  3. A native punctuation + capitalisation decoder that reduces the post-processing cleanup that hurts WER in other models

The 3 languages where Gemini 3.1 Flash still leads are Mandarin, Japanese, and Korean — Google’s own training corpus strength in these languages is hard to beat.

Product Wiring

MAI-Transcribe-1 is being rolled into:

  • Microsoft Teams live captions and meeting transcripts
  • Outlook dictation
  • Azure Speech-to-Text API
  • Copilot voice features across Windows 11

Pricing on Azure is set aggressively below Whisper-large-v3 — explicitly framed as an enterprise-grade upgrade path.

The Bigger Picture

Microsoft’s MAI family now includes:

  • MAI-1 (general frontier reasoning)
  • MAI-Voice-1 (speech generation)
  • MAI-Transcribe-1 (speech recognition, today)

The company has not disclosed parameter counts for the new model, but has said on earnings calls that “frontier-class capability at one-tenth the compute” is the design target across the family. This is Microsoft diversifying its AI dependency base at industrial scale.

Source: Microsoft Research Blog / VentureBeat

Leave a Reply

Your email address will not be published. Required fields are marked *