The Next-gen Kaldi team has released OmniVoice, a zero-shot text-to-speech model that supports over 600 languages under an Apache 2.0 license. The model runs 40x faster than real-time and includes voice cloning, voice design, and automatic voice selection out of the box.

What Happened

OmniVoice launched on GitHub and Hugging Face on April 1, with an accompanying research paper detailing the architecture. The model was trained on 581,000 hours of multilingual open-source audio data, making it the broadest-coverage zero-shot TTS system available.

Built on a diffusion language model architecture initialized from Qwen3-0.6B, OmniVoice maps text directly to multi-codebook acoustic tokens. This avoids the typical two-stage pipeline most TTS systems use, resulting in a cleaner and more scalable design.

Why It Matters

Most open source TTS models cover a handful of languages well. OmniVoice covers 646, which opens doors for creators working with multilingual content, localization pipelines, or audiences outside the English-speaking world. Voice cloning works from a single reference audio clip, and voice design lets you specify attributes like gender, age, pitch, accent, and even whisper style through text prompts.

The 40x real-time inference speed (RTF of 0.025) means you can generate a minute of speech in roughly 1.5 seconds. For creators building voiceovers, podcast intros, or narration at scale, that speed removes a significant bottleneck.

Key Details

  • Languages: 646 supported (broadest zero-shot TTS coverage to date)
  • License: Apache 2.0 (fully open for commercial use)
  • Speed: RTF as low as 0.025 (40x faster than real-time)
  • Base model: Qwen3-0.6B with diffusion language model architecture
  • Training data: 581,000 hours of open-source multilingual audio
  • Modes: Voice cloning, voice design (text-prompted attributes), auto voice
  • Non-verbal sounds: Supports laughter, sighs, and other expressions via tags
  • Pronunciation control: Pinyin with tone (Chinese), CMU dictionary (English)

The model claims competitive or better performance compared to existing open alternatives across Chinese, English, and multilingual benchmarks. A live demo is available on Hugging Face Spaces.

What to Do Next

Try OmniVoice through the Hugging Face demo or install it locally via the GitHub repository. The Apache 2.0 license means you can use it in commercial projects without restrictions. If you work with Fish Audio S2 or Voxtral TTS, OmniVoice is worth benchmarking against your current setup for multilingual coverage.