Microsoft used its Build 2026 keynote on June 2 to ship a refreshed MAI media stack for creators, pairing the already-trending MAI-Image-2.5 image model with two new audio releases: MAI-Voice-2 for multilingual text-to-speech and MAI-Transcribe-1.5 for 25-language speech recognition. All three will roll into MAI Playground and Microsoft Foundry over the next two weeks.

What This Enables

Microsoft is consolidating its media generation portfolio into a single suite that Foundry developers can call from one API surface. MAI-Image-2.5 now accepts image uploads, opening the model to editing flows rather than text-only generation, so a creator can iterate on an existing asset instead of rerolling from scratch. MAI-Voice-2 adds emotional tone variations across 14-plus languages, which closes the expressiveness gap that pushed many studios to ElevenLabs for narration work. MAI-Transcribe-1.5 holds a 3.9 percent word error rate across 25 languages, putting it in the same accuracy band as Whisper Large for localization pipelines.

Why It Matters

Microsoft's image model already ranks third on the Artificial Analysis text-to-image Arena with a score of 1,254, a 72-point jump over MAI-Image-2. By bundling image editing, voice, and transcription into one Foundry release, Microsoft is going after the same omnimodal creator surface that Google now ships through Vertex AI. Pricing has not been disclosed, but the Foundry rollout signals enterprise availability without a separate Azure OpenAI dependency.

Key Details

The full MAI v2 lineup, drawn from Microsoft's own announcement and third-party Build coverage:

  • MAI-Image-2.5: standard plus efficient variants, image-to-image editing, text rendering improvements, ranked third on the image Arena leaderboard.
  • MAI-Voice-2: 14-plus languages, emotional tone control, designed for narration and dialog.
  • MAI-Transcribe-1.5: 25 languages, 3.9 percent WER, incremental upgrade over v1.
  • Distribution: Available on Arena today, MAI Playground and Microsoft Foundry within two weeks.

What to Do Next

Try MAI-Image-2.5 directly on Arena to benchmark it against your current image generation stack on the prompts that matter for your work. If your pipeline already runs on Azure or Foundry, line up the Voice-2 and Transcribe-1.5 evaluation against your existing TTS and ASR vendors so you have a comparison ready when the models hit Foundry in two weeks.