OpenBMB released VoxCPM2, a 2 billion parameter text-to-speech model that runs on 8GB VRAM, supports 30 languages at 48kHz studio quality, and can design entirely new voices from natural language descriptions. The model is open-sourced under Apache 2.0.
For the broader landscape, see our complete producer guide to AI music and audio in 2026.
What Happened
VoxCPM2 takes a different approach to speech synthesis by skipping discrete tokenization entirely. Instead, the model uses a diffusion autoregressive architecture built on the MiniCPM-4 backbone to generate continuous speech representations directly. The result is more natural-sounding output at 48kHz, trained on over 2 million hours of multilingual speech data.
The model ships with four distinct capabilities. Voice Design lets creators describe a voice in plain text and get a matching synthesis. Controllable Cloning reproduces a voice from a short audio sample with optional style guidance. Ultimate Cloning captures finer vocal nuances using reference audio paired with transcripts. All four modes work across 30 languages without language tags, meaning the model detects and switches languages automatically.
Real-time performance is practical: VoxCPM2 hits a real-time factor of 0.3 on an RTX 4090, dropping to 0.13 with Nano-VLLM acceleration. The 8GB VRAM requirement means it runs on consumer GPUs including the RTX 4060 and 4070.
Why It Matters for Creators
Voice design from text descriptions opens a workflow that previously required hiring voice actors or manually mixing voice characteristics. A creator can type "warm female narrator, slight British accent, mid-30s" and get a usable synthetic voice for video narration, podcasts, or game dialogue. Combined with the open Apache 2.0 license, this makes VoxCPM2 usable in commercial projects without licensing fees.
The 30-language support without manual language switching is particularly useful for creators producing multilingual content. Our earlier coverage of ElevenLabs going on-premise showed demand for local voice AI. VoxCPM2 fills a similar gap for creators who prefer open-source tools they fully control.
Key Details
Parameters: 2 billion
Training data: 2M+ hours of multilingual speech
Output quality: 48kHz studio quality
VRAM required: 8GB minimum
Languages: 30 (automatic detection, no tags needed)
License: Apache 2.0 (free commercial use)
Token rate: 6.25Hz
What to Do Next
Clone the GitHub repository and follow the setup instructions. Weights are available on HuggingFace and ModelScope. Try the live demo on HuggingFace Spaces first to test voice design and cloning before running locally. Documentation is at voxcpm.readthedocs.io.
This story was covered by Creative AI News.
Subscribe for free to get the weekly digest every Tuesday.