Mistral AI released Voxtral TTS, a 4-billion-parameter open-weights text-to-speech model that the company says matches or beats ElevenLabs on naturalness benchmarks. The model supports nine languages and can clone a voice from just three seconds of reference audio.

What Happened

Voxtral TTS is Mistral's first entry into speech generation, built on the Ministral 3B backbone. The model combines a 3.4B transformer decoder, a 390M flow-matching acoustic transformer, and a 300M neural audio codec into a single pipeline that generates lifelike speech at 70ms latency for typical inputs.

The weights are available on Hugging Face under a CC BY-NC 4.0 license. API access runs $0.016 per 1,000 characters through Mistral's platform, Le Chat, and Mistral Studio.

Why It Matters

Open-weights TTS models have lagged behind commercial APIs in quality. Voxtral closes that gap. In human evaluations, it scored higher on naturalness than ElevenLabs Flash v2.5 while matching the quality of ElevenLabs v3, which has been the industry benchmark for production voice work.

The voice cloning capability needs only three seconds of reference audio to capture a speaker's rhythm, pauses, and emotional expression. It also handles cross-lingual adaptation out of the box, so a French voice prompt speaking English retains natural French intonation. For creators building multilingual content, that removes a significant production step.

Key Details

  • Languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic
  • Latency: 70ms time-to-first-audio for a 10-second voice sample with 500 characters of text
  • Real-time factor: 9.7x, meaning it generates audio nearly 10 times faster than playback speed
  • Duration: Natively generates up to two minutes per call, with the API handling longer content via smart interleaving
  • Watermarking: All generated audio includes SynthID watermarks

The model puts Mistral in direct competition with ElevenLabs, Deepgram, and OpenAI in the voice agent market. Unlike those closed alternatives, Voxtral's open weights mean developers can self-host and customize the model for their own pipelines.

What to Do Next

Creators who need voice-over, narration, or voice agent capabilities should test Voxtral against their current provider. The three-second voice cloning makes it fast to evaluate on real content. Mistral's earlier releases, including Mistral Small 4 and Mistral Forge, are also worth pairing for text-plus-voice workflows. For a broader look at where open-source speech models stand, see our Fish Audio S2 and Hume TADA coverage.