OmniVoice: Open Source TTS Covers 600 Languages

The Next-gen Kaldi team has released OmniVoice, a zero-shot text-to-speech model that supports over 600 languages under an Apache 2.0 license. The model runs 40x faster than real-time and includes voice cloning, voice design, and automatic voice selection out of the box.

For the broader landscape, see our complete producer guide to AI music and audio in 2026.

What Happened

OmniVoice launched on GitHub and Hugging Face on April 1, with an accompanying research paper detailing the architecture. The model was trained on 581,000 hours of multilingual open-source audio data, making it the broadest-coverage zero-shot TTS system available.

Built on a diffusion language model architecture initialized from Qwen3-0.6B, OmniVoice maps text directly to multi-codebook acoustic tokens. This avoids the typical two-stage pipeline most TTS systems use, resulting in a cleaner and more scalable design.

Why It Matters

Most open source TTS models cover a handful of languages well. OmniVoice covers 646, which opens doors for creators working with multilingual content, localization pipelines, or audiences outside the English-speaking world. Voice cloning works from a single reference audio clip, and voice design lets you specify attributes like gender, age, pitch, accent, and even whisper style through text prompts.

The 40x real-time inference speed (RTF of 0.025) means you can generate a minute of speech in roughly 1.5 seconds. For creators building voiceovers, podcast intros, or narration at scale, that speed removes a significant bottleneck.

Key Details

Languages: 646 supported (broadest zero-shot TTS coverage to date)
License: Apache 2.0 (fully open for commercial use)
Speed: RTF as low as 0.025 (40x faster than real-time)
Base model: Qwen3-0.6B with diffusion language model architecture
Training data: 581,000 hours of open-source multilingual audio
Modes: Voice cloning, voice design (text-prompted attributes), auto voice
Non-verbal sounds: Supports laughter, sighs, and other expressions via tags
Pronunciation control: Pinyin with tone (Chinese), CMU dictionary (English)

The model claims competitive or better performance compared to existing open alternatives across Chinese, English, and multilingual benchmarks. A live demo is available on Hugging Face Spaces.

What to Do Next

Try OmniVoice through the Hugging Face demo or install it locally via the GitHub repository. The Apache 2.0 license means you can use it in commercial projects without restrictions. If you work with Fish Audio S2 or Voxtral TTS, OmniVoice is worth benchmarking against your current setup for multilingual coverage.

OmniVoice: Open Source TTS Covers 600 Languages

What Happened

Why It Matters

Key Details

What to Do Next

Keep reading

Cursor Canvas Adds Design Mode and Context Usage Report

ChatGPT Dreaming V3: Memory That Updates While You Sleep

NVIDIA Nemotron 3.5 ASR: 40 Languages at 80ms Latency

What Happened

Why It Matters

Key Details

What to Do Next

Stay ahead of AI

Keep reading

Cursor Canvas Adds Design Mode and Context Usage Report

ChatGPT Dreaming V3: Memory That Updates While You Sleep

NVIDIA Nemotron 3.5 ASR: 40 Languages at 80ms Latency

Stay ahead of Creative AI