Xiaomi has launched the MiMo-V2 family of AI models, headlined by two releases that matter for creators: MiMo-V2-Omni, a multimodal model that handles audio, images, video, and text in a single system, and MiMo-V2-TTS, a speech synthesis engine with contextual emotion awareness and the ability to sing.
What Happened
The MiMo-V2 launch includes three models. MiMo-V2-Pro is the flagship language model with over 1 trillion parameters and a 1MB context window, but it is the Omni and TTS variants that carry the most significance for creative workflows.
MiMo-V2-Omni processes audio, images, video, and text through a unified architecture. On the MMAU-Pro benchmark for multimodal understanding, it scores 69.4, beating Gemini 3 Pro's score of 65.0. It can handle over 10 hours of continuous audio input, making it capable of processing full-length podcasts, lectures, or video content in a single pass.
MiMo-V2-TTS goes beyond standard text-to-speech by adding contextual emotion awareness. The model reads the surrounding text to determine appropriate emotional tone, produces paralinguistic events like coughs, sighs, and laughter, and can generate singing in a unified speech-plus-singing synthesis pipeline. It was pretrained on over 100 million hours of audio data.
Why It Matters
For creators working with audio and video, MiMo-V2 represents a significant new option. The Omni model's ability to ingest hours of multimedia content opens up workflows for content analysis, transcription, and understanding that previously required chaining together multiple specialized tools.
The TTS model is particularly interesting for content creators. Contextual emotion means the voice output responds to what is being said, not just how it is marked up. A sad passage sounds sad. A joke lands with the right timing. The singing capability adds another dimension, potentially useful for jingles, musical content, or creative projects that blend speech and music.
Pricing is also notable. MiMo-V2-Pro's API starts at $1 per million input tokens, significantly undercutting comparable Western models while approaching their performance levels. The models currently support Mandarin (with regional dialects) and English.
Key Details
- MiMo-V2-Pro: 1T+ parameters, 1MB context, $1/M input tokens
- MiMo-V2-Omni: Multimodal (audio, image, video, text), MMAU-Pro 69.4, 10+ hour audio input
- MiMo-V2-TTS: Emotion-aware speech, paralinguistic events, singing synthesis, 100M+ hours pretraining
- Languages: Mandarin (with dialects) and English
- Access: Demo at aistudio.xiaomimimo.com, API at platform.xiaomimimo.com
- Integration: Already available in Kingsoft WPS Office
What to Do Next
If you create audio content, podcasts, or video, the MiMo-V2-TTS model is worth experimenting with through the demo site. The emotion-aware synthesis could replace or supplement existing TTS tools, especially for content that needs natural-sounding narration rather than flat robotic delivery.
The Omni model's long-context audio processing is relevant for anyone who works with extended recordings. Processing a full podcast episode or multi-hour video in a single API call, without chunking and reassembling, simplifies workflows and reduces the risk of losing context at segment boundaries. As competition in the multimodal space intensifies between established players and new entrants like Xiaomi, creators benefit from more choices at lower price points.