Xiaomi released MiMo-V2.5 on April 22, a complete open-source voice pipeline pairing an 8-billion-parameter speech recognizer with three text-to-speech models. The ASR model is available on GitHub and HuggingFace, while the TTS models are accessible through Xiaomi's MiMo Open Platform.
For the broader landscape, see our complete producer guide to AI music and audio in 2026.
What Happened
MiMo-V2.5 ships as two complementary systems. The ASR side is a single 8B-parameter end-to-end model that handles speech recognition across Mandarin, English, and multiple Chinese dialects including Wu, Cantonese, Hokkien, and Sichuanese, with no preset language tags required. It also recognizes song lyrics, ancient poetry, and handles noisy multi-speaker environments.
The TTS side includes three variants: a base model with curated premium voices and fine-grained control over speech rate and emotion, a VoiceDesign model that generates entirely new voices from text descriptions alone without reference audio, and a VoiceClone model that replicates voices from 30-second audio samples without training or fine-tuning.
Why It Matters
Most open-source voice tools handle either recognition or synthesis, not both. MiMo-V2.5 delivers a full pipeline from speech input to voice output, which is what agent-based workflows need. The dialect support is particularly notable. Voice tools like OmniVoice cover many languages, but few handle regional Chinese dialects with dedicated optimization.
The VoiceDesign model stands out. Instead of cloning an existing voice, creators describe the voice they want in plain text and the model generates it. No reference recordings, no parameter tuning. For podcasters, game developers, and content creators who need distinct character voices, this removes a significant production bottleneck.
Key Details
- ASR model: 8B parameters, end-to-end, bilingual Mandarin-English with code-switching
- TTS models: Base (premium voices), VoiceDesign (text-described voices), VoiceClone (30-sec samples)
- Dialect support: Wu, Cantonese, Hokkien, Sichuanese, Henan, Northeastern, Taiwanese Mandarin
- ASR benchmark: 5.73% WER on Open ASR Leaderboard (vs. Whisper large-v3 at 7.44%)
- Song lyrics: 3.95% WER on m4singer dataset
- License: Apache 2.0 (ASR model, open weights)
- Requirements: Python 3.12, CUDA 12.0+
What to Do Next
Download the ASR model from HuggingFace or clone the GitHub repo for local inference. The TTS models are available on Xiaomi's MiMo platform. For a broader comparison of voice tools, see our AI voice cloning comparison.