Microsoft used the Build 2026 keynote on June 2 to ship four MAI models into Azure AI Foundry at once: MAI-Image-2.5 and a Flash variant for text-to-image and image-to-image, MAI-Voice-2 for 15-language cloning, MAI-Transcribe-1.5 for 43-language speech-to-text, and the private-preview MAI-Thinking-1 reasoning model. For creators already running Adobe-or-OpenAI-or-ElevenLabs as separate line items, the new question is whether a single Foundry contract beats the multi-vendor stack on price, output quality, and integration friction. We compared the publicly disclosed numbers and the head-to-head benchmark claims from Microsoft against the current rivals across image, voice, and transcription, plus what the bundle costs once Copilot and PowerPoint are already paid for.

Quick Picks

Pick MAI-Image-2.5 if your team lives in PowerPoint, OneDrive, or Copilot and you need brand-character identity preservation on edits without a separate image-API contract. The Flash variant prices below the existing Foundry rate card and lands the model in the same tenant your governance team already approved.

Pick Nano Banana 2 if you are deep in Google Workspace, Vertex AI, or Gemini-powered Slides and your priority is image quality at the top of the Arena leaderboard. The launch deep dive covers Vertex AI GA terms in detail.

Arena leaderboard comparison chart showing MAI-Image-2.5 at No. 3 for text-to-image and No. 2 for image-to-image versus Nano Banana 2, GPT-Image-2, FLUX.2, and Ideogram 4

Pick FLUX.2 or Ideogram 4 if open weights, ComfyUI control, or on-device generation matters more than a hosted API. FLUX.2 Klein runs locally on ProArt hardware; Ideogram 4 ships with open weights for ComfyUI pipelines.

Image Generation: MAI-Image-2.5 vs the Field

Arena leaderboard positions reported by Microsoft at Build 2026. Independent ranks change weekly; verify on the live leaderboard.

Microsoft says MAI-Image-2.5 debuted at No. 3 for text-to-image and No. 2 for image-to-image on the Arena leaderboard, with a specific claim that it beats Google's Nano Banana 2 on the edit task. The keynote highlighted identity preservation as the lead feature: brand characters, recognizable faces, and full-body shots survive style, pose, and layout changes through prompts alone. That positioning is aimed directly at the workflow where Nano Banana 2 has been the rival to beat for the last six weeks since Vertex AI GA.

The Flash variant is the line that matters for production batches. At $1.75 per million input tokens and $33 per million image-output tokens per the Microsoft AI keynote transcript, it sits well below the full MAI-Image-2.5 rate of $5 input, $8 image input, and $47 image output. A studio that runs a thousand-image batch through Flash on Foundry pays a small fraction of the per-image price most hosted APIs quote today, and the per-token billing means the cost scales with prompt length and resolution rather than a flat per-image fee.

ModelArena Rank (T2I / I2I)Edit StrengthPricing (per 1M tokens)Open Weights
MAI-Image-2.53 / 2Beats Nano Banana 2 on edit (per Microsoft)$5 in / $8 img-in / $47 img-outNo
MAI-Image-2.5 FlashSame family, batch tierIdentity preservation across style and pose$1.75 in / $33 img-outNo
Nano Banana 2Previously No. 1 T2IStrong on photoreal edits and character lockPer-image on Vertex AINo
FLUX.2 ProArena top 5Inpainting and Klein on-device variantHosted plus open Klein tierKlein only
Ideogram 4Arena top 5Best-in-class text renderingHosted plus ComfyUI open weightsYes

The honest read on the leaderboard claim: Arena rankings rotate weekly, and Microsoft's No. 3 / No. 2 positions were the snapshot at the keynote. What is more durable is the edit-task framing, because that is where Nano Banana 2 made its name and where MAI-Image-2.5 has direct receipts in the form of a head-to-head benchmark. For creators whose deliverables are decks and campaign assets rather than gallery pieces, the edit category is the only category that matters.

Voice model price comparison chart showing MAI-Voice-2 at $22 per million characters versus ElevenLabs and OpenAI TTS, with 15-language coverage breakdown

Voice Cloning: MAI-Voice-2 vs ElevenLabs and OpenAI

Voice cloning price comparison. ElevenLabs and OpenAI TTS pricing depends on tier and character-versus-credit accounting; verify against current rate cards.

MAI-Voice-2 is priced at $22 per million characters with 15-plus language support, voice cloning from a reference sample, and a voice-prompting feature that lets creators steer delivery through prompts rather than waveform edits. The Microsoft positioning is that the model already powers Copilot and Bing voice output, so any studio that has tuned a Copilot agent has heard MAI-Voice-2 without realizing it.

The rival that matters here is ElevenLabs, whose Dubbing v2 launched with 90-language coverage and emotion-preserving prosody. ElevenLabs's headline edge is language breadth and the multilingual emotion model, while MAI-Voice-2's edge is the price-and-integration combination: $22 per million characters is roughly half what a creator pays on the standard ElevenLabs creator tier when you account for character versus credit accounting, and Foundry billing rolls into an existing Microsoft 365 invoice.

OpenAI's text-to-speech API sits in a different category: fewer voices, no cloning by default, and a focus on natural-sounding stock voices for product UX rather than character work. For podcasters and dub teams the choice is MAI-Voice-2 versus ElevenLabs; for product voice and notification UI, it is MAI-Voice-2 versus OpenAI TTS.

Transcription: MAI-Transcribe-1.5 vs Whisper and Deepgram

FLEURS word error rate as reported by Microsoft at Build 2026, with current rivals for comparison. Verify on the live FLEURS leaderboard for the day of evaluation.

MAI-Transcribe-1.5 holds the No. 1 spot on the FLEURS benchmark with a 3.7 percent Word Error Rate across 43 languages per the Foundry announcement, and Microsoft prices it at $0.36 per audio hour. That is the headline number creators should benchmark against: most podcast editors today pay between $0.50 and $1.20 per audio hour through Whisper-based services or Deepgram, and the FLEURS-leading WER means the cleanup edit pass shrinks. A weekly podcast that runs 90 minutes of raw audio pays roughly $0.54 per episode for transcription on MAI-Transcribe-1.5, billed inside Foundry.

The model adds content biasing, which is the feature that closes the last accuracy gap for creators with technical vocabulary, brand names, or non-English place names. Whisper Large v3 supports custom biasing only through wrapper services; MAI-Transcribe-1.5 takes a bias list as part of the API call. For studios whose workflow includes a transcript-then-script step, that is the integration that removes an entire correction pass.

Transcription benchmark comparison showing MAI-Transcribe-1.5 at FLEURS No. 1 with 3.7 percent Word Error Rate against Whisper Large v3 and Deepgram across 43 languages

Reasoning Model: MAI-Thinking-1 vs Claude Opus 4.6

MAI-Thinking-1 is in private preview. Pricing has not been disclosed; the benchmark claim is from Microsoft's keynote.

MAI-Thinking-1 is the surprise of the launch: a mixture-of-experts reasoning model in private preview that Microsoft claims matches Claude Opus 4.6 on SWE-Bench Pro at substantially lower cost. The private-preview status means pricing is not yet disclosed, so the value claim is unprovable until the model lands in general availability. What is verifiable is the architectural choice: MoE reasoning models trade compute for routing efficiency, and the SWE-Bench Pro framing tells engineering teams Microsoft is positioning this as a Copilot-grade coding model rather than a generalist assistant.

For creators specifically, the more relevant Thinking-1 question is whether it improves prompt rewriting and instruction expansion inside MAI-Image-2.5 and MAI-Voice-2 the way reasoning chains improved DALL-E 3 prompting two years ago. Microsoft did not detail that interaction in the keynote, but the bundling pattern suggests it is the direction.

When Each One Wins

MAI-Foundry wins when the studio is already paying Microsoft 365, the deliverables are PowerPoint decks or Bing-distributed content, and identity preservation on edits matters more than gallery aesthetics. The bundled-tenant story is the strongest argument: one invoice, one governance review, one identity provider, and a Flash tier that prices below the existing Foundry rate card. Switching cost from a multi-vendor stack to Foundry is the deck-template refactor, not a workflow rebuild.

Google Vertex AI wins when the team lives in Workspace, Slides, and Looker, and Nano Banana 2's text-to-image lead on Arena matters for top-of-funnel marketing creative. Nano Banana 2 still wins on the absolute leaderboard; MAI-Image-2.5 wins on the edit subtask. The choice depends on whether the workflow is generation-first or edit-first.

BFL and Ideogram win when open weights, local control, and ComfyUI-native pipelines are the requirements. FLUX.2 Klein and Ideogram 4's open weights mean the model lives on the studio's GPUs, costs nothing per image after the hardware amortizes, and stays inside the firewall. For a studio that hit the API-budget ceiling, the open-weights path beats every hosted comparison on per-image cost.

Pricing and ROI

The Flash variant is the line that changes the bundled-stack math. A studio running 50,000 image generations per month at an average of 200 tokens in and 12,000 tokens out per image lands at roughly $4,000 per month on MAI-Image-2.5 Flash, billed inside the same Foundry contract that handles GPT-class models. On the per-image API tier that most studios currently pay, the comparable spend often clears $6,000 to $8,000. Combine that with $0.36 per audio hour transcription and $22 per million characters for voice cloning, and a mid-sized podcast-plus-deck studio plausibly saves a four-figure monthly line item by consolidating to Foundry.

MAI-Thinking-1 versus Claude Opus 4.6 comparison on SWE-Bench Pro showing matched accuracy at substantially lower cost in the private preview tier

The catch is lock-in: once governance, identity, and billing all run through Foundry, the switching cost back to a multi-vendor stack rises sharply. The honest tradeoff is Foundry's per-tenant convenience against the optionality of buying best-of-breed from each rival, which is what teams using ElevenLabs plus Nano Banana 2 plus Whisper-via-wrapper are doing today.

Verdict

For creators inside the Microsoft 365 stack, MAI-Foundry is the new default and the Flash variant is the line that justifies the consolidation. For Workspace-first teams, Nano Banana 2 still wins the absolute image quality benchmark and Google's bundling story matches Microsoft's beat for beat. For studios that need open weights or local control, BFL FLUX.2 and Ideogram 4 stay the right answer. The category to watch is reasoning: if MAI-Thinking-1 lands at GA with a price that backs up the SWE-Bench Pro claim, the Foundry bundle moves from convenience to competitive. The launch is paired with the same-keynote Windows Agent Framework rollout, which means the agentic-stack story is the medium-term play Microsoft is positioning these models to support.

Frequently Asked Questions

Is MAI-Image-2.5 available outside Azure AI Foundry?

The model powers Copilot, Bing, PowerPoint, and OneDrive image features for end users, but the API access for developers is through the Foundry Model Catalog. There is no standalone hosted endpoint outside Foundry as of the June 2 launch.

How does MAI-Image-2.5 Flash differ from the full model?

Flash is the production-batch tier at $1.75 per million input tokens and $33 per million image-output tokens, versus $5 input and $47 image output on the full model. The full model targets identity preservation on hero assets; Flash targets the long tail of batch generations where the cost-per-image matters more than maximum fidelity.

Can MAI-Voice-2 clone any voice with a reference sample?

Microsoft has not published the consent and provenance terms for arbitrary reference samples. The model supports cloning with consent in 15-plus languages at $22 per million characters; production use should be paired with an enterprise consent workflow.

Does MAI-Transcribe-1.5 work for non-English podcasts?

The model covers 43 languages and ranks first on the FLEURS multilingual benchmark with a 3.7 percent average Word Error Rate. Content biasing is supported as part of the API call, so brand names and technical terms can be added to the recognition vocabulary.

When will MAI-Thinking-1 leave private preview?

Microsoft has not announced a general-availability date. Private preview access is by request through the Foundry team. Until pricing is disclosed, the cost claim against Claude Opus 4.6 cannot be independently verified.