Xiaomi MiMo-V2.5 Open-Sources 8B Voice Pipeline

Xiaomi released MiMo-V2.5 on April 22, a complete open-source voice pipeline pairing an 8-billion-parameter speech recognizer with three text-to-speech models. The ASR model is available on GitHub and HuggingFace, while the TTS models are accessible through Xiaomi's MiMo Open Platform.

For the broader landscape, see our complete producer guide to AI music and audio in 2026.

What Happened

MiMo-V2.5 ships as two complementary systems. The ASR side is a single 8B-parameter end-to-end model that handles speech recognition across Mandarin, English, and multiple Chinese dialects including Wu, Cantonese, Hokkien, and Sichuanese, with no preset language tags required. It also recognizes song lyrics, ancient poetry, and handles noisy multi-speaker environments.

The TTS side includes three variants: a base model with curated premium voices and fine-grained control over speech rate and emotion, a VoiceDesign model that generates entirely new voices from text descriptions alone without reference audio, and a VoiceClone model that replicates voices from 30-second audio samples without training or fine-tuning.

Why It Matters

Most open-source voice tools handle either recognition or synthesis, not both. MiMo-V2.5 delivers a full pipeline from speech input to voice output, which is what agent-based workflows need. The dialect support is particularly notable. Voice tools like OmniVoice cover many languages, but few handle regional Chinese dialects with dedicated optimization.

The VoiceDesign model stands out. Instead of cloning an existing voice, creators describe the voice they want in plain text and the model generates it. No reference recordings, no parameter tuning. For podcasters, game developers, and content creators who need distinct character voices, this removes a significant production bottleneck.

Key Details

ASR model: 8B parameters, end-to-end, bilingual Mandarin-English with code-switching
TTS models: Base (premium voices), VoiceDesign (text-described voices), VoiceClone (30-sec samples)
Dialect support: Wu, Cantonese, Hokkien, Sichuanese, Henan, Northeastern, Taiwanese Mandarin
ASR benchmark: 5.73% WER on Open ASR Leaderboard (vs. Whisper large-v3 at 7.44%)
Song lyrics: 3.95% WER on m4singer dataset
License: Apache 2.0 (ASR model, open weights)
Requirements: Python 3.12, CUDA 12.0+

What to Do Next

Download the ASR model from HuggingFace or clone the GitHub repo for local inference. The TTS models are available on Xiaomi's MiMo platform. For a broader comparison of voice tools, see our AI voice cloning comparison.

Xiaomi MiMo-V2.5 Open-Sources 8B Voice Pipeline

What Happened

Why It Matters

Key Details

What to Do Next

Keep reading

Gemini Generates PDFs, Excel, Slides Direct From Chat

IBM Granite 4.1: Dense LLMs Walk Back the MoE Bet

Mistral Medium 3.5: 128B Open Weights, Cloud Vibe Agents

What Happened

Why It Matters

Key Details

What to Do Next

Stay ahead of AI

Keep reading

Gemini Generates PDFs, Excel, Slides Direct From Chat

IBM Granite 4.1: Dense LLMs Walk Back the MoE Bet

Mistral Medium 3.5: 128B Open Weights, Cloud Vibe Agents

Stay ahead of Creative AI