Choosing a text-to-speech voice for narration, podcasts, or character work just got easier to reason about. A widely shared open-source project, tts-bench, has rolled out a major revamp that now compares 46 text-to-speech models side by side using both objective scores and blind human voting, giving creators a single place to see which voices actually hold up.

What Happened

The tts-bench project expanded its leaderboard to cover 46 TTS models, split into 12 with fixed preset voices and 34 that can clone a voice from a short reference clip. Alongside the raw benchmark, a companion TTS Voting Arena runs blind A/B listening tests where model names are hidden, so rankings reflect what people actually prefer rather than marketing claims. The cloning arena has logged 397 human-preference votes across 28 models so far.

Why It Matters

Most TTS comparisons lean on a single demo clip or a vendor's own numbers. This benchmark separates three things creators care about: speed, naturalness, and cloning fidelity. Speed is measured as time-to-first-audio and real-time factor across CPU, NVIDIA CUDA, and Apple Silicon, so you can tell whether a model is fast enough for live or batch work. Lightweight models like Kokoro post the strongest GPU latency, which matters if you are generating dozens of voiceover takes a day.

Key Details

On the objective side, the benchmark reports UTMOS for naturalness, WER for intelligibility, and SIM for how closely a clone matches the reference. The maintainers note that a combined quality score was pulled and is being redesigned to avoid a single misleading number. On CPU-only setups, Piper remains the speed leader at roughly 107ms warm time-to-first-audio. In the blind cloning votes, OmniVoice, Echo-TTS, and IndexTTS-2 currently top the table, though the notes flag that the leading model occasionally garbles words. For a second opinion on rankings, the independent Artificial Analysis speech leaderboard tracks a similar blind-preference ELO across commercial and open models.

What to Do Next

Shortlist two or three models that fit your hardware, then trust your own ears over any leaderboard position. Run the same script through each, listen on the headphones or speakers your audience uses, and check pronunciation on the names and jargon specific to your niche. If you need open weights you can self-host, our breakdown of Higgs Audio v3 is a good starting point for a production voice pipeline. Bookmark the benchmark and re-check it monthly, since the cloning models in particular are moving fast.