VoxCPM2 Open-Sources 2B-Param TTS With 30 Languages

OpenBMB released VoxCPM2, a 2 billion parameter text-to-speech model that runs on 8GB VRAM, supports 30 languages at 48kHz studio quality, and can design entirely new voices from natural language descriptions. The model is open-sourced under Apache 2.0.

For the broader landscape, see our complete producer guide to AI music and audio in 2026.

What Happened

VoxCPM2 takes a different approach to speech synthesis by skipping discrete tokenization entirely. Instead, the model uses a diffusion autoregressive architecture built on the MiniCPM-4 backbone to generate continuous speech representations directly. The result is more natural-sounding output at 48kHz, trained on over 2 million hours of multilingual speech data.

The model ships with four distinct capabilities. Voice Design lets creators describe a voice in plain text and get a matching synthesis. Controllable Cloning reproduces a voice from a short audio sample with optional style guidance. Ultimate Cloning captures finer vocal nuances using reference audio paired with transcripts. All four modes work across 30 languages without language tags, meaning the model detects and switches languages automatically.

Real-time performance is practical: VoxCPM2 hits a real-time factor of 0.3 on an RTX 4090, dropping to 0.13 with Nano-VLLM acceleration. The 8GB VRAM requirement means it runs on consumer GPUs including the RTX 4060 and 4070.

Why It Matters for Creators

Voice design from text descriptions opens a workflow that previously required hiring voice actors or manually mixing voice characteristics. A creator can type "warm female narrator, slight British accent, mid-30s" and get a usable synthetic voice for video narration, podcasts, or game dialogue. Combined with the open Apache 2.0 license, this makes VoxCPM2 usable in commercial projects without licensing fees.

The 30-language support without manual language switching is particularly useful for creators producing multilingual content. Our earlier coverage of ElevenLabs going on-premise showed demand for local voice AI. VoxCPM2 fills a similar gap for creators who prefer open-source tools they fully control.

Key Details

Parameters: 2 billion

Training data: 2M+ hours of multilingual speech

Output quality: 48kHz studio quality

VRAM required: 8GB minimum

Languages: 30 (automatic detection, no tags needed)

License: Apache 2.0 (free commercial use)

Token rate: 6.25Hz

What to Do Next

Clone the GitHub repository and follow the setup instructions. Weights are available on HuggingFace and ModelScope. Try the live demo on HuggingFace Spaces first to test voice design and cloning before running locally. Documentation is at voxcpm.readthedocs.io.

This story was covered by Creative AI News.

Subscribe for free to get the weekly digest every Tuesday.

VoxCPM2 Open-Sources 2B-Param TTS With 30 Languages

What Happened

Why It Matters for Creators

Key Details

What to Do Next

Keep reading

Gemini Generates PDFs, Excel, Slides Direct From Chat

IBM Granite 4.1: Dense LLMs Walk Back the MoE Bet

Mistral Medium 3.5: 128B Open Weights, Cloud Vibe Agents

What Happened

Why It Matters for Creators

Key Details

What to Do Next

Stay ahead of AI

Keep reading

Gemini Generates PDFs, Excel, Slides Direct From Chat

IBM Granite 4.1: Dense LLMs Walk Back the MoE Bet

Mistral Medium 3.5: 128B Open Weights, Cloud Vibe Agents

Stay ahead of Creative AI