Boson AI has released Higgs Audio v3 TTS 4B, a 4-billion-parameter text-to-speech model built specifically for voice chat. Released on June 4, 2026, the model supports 100 languages, delivers zero-shot voice cloning from a short reference clip, and gives developers inline control over emotion, speed, pitch, and sound effects mid-sentence. It runs via an OpenAI-compatible API, making it a drop-in upgrade path for voice AI applications.

What Happened

Boson AI published the Higgs Audio v3 TTS 4B weights on HuggingFace alongside a launch blog post and a full serving cookbook using SGLang-Omni. The model ships under a non-commercial research license; commercial use requires a separate agreement with Boson AI.

The jump from v2 to v3 is substantial. On the MiniMax multilingual benchmark, word error rate dropped from 49.86 to 2.74 (a 94% reduction). On the CV3 benchmark, it fell from 21.19 to 4.41. Across all tested benchmarks, v3 achieves a 53.65% win rate against Fish Audio S2 and Qwen3.

The model architecture is a 36-layer autoregressive decoder producing 24 kHz audio at 25 frames per second, with 8 codebooks and a context window of 8,192 tokens. Each frame is 40 milliseconds, which is low enough for real-time voice chat applications.

Why It Matters

Most open-weight TTS models handle static narration reasonably well but fall apart on conversational speech. Pauses land in the wrong places, emotional cues get flattened, and foreign words get mispronounced. Higgs Audio v3 is the first open model in this weight class to address all three issues in a single release.

For creators building voice-driven applications, the inline control system is the headline feature. Tags like <|emotion:amusement|>, <|style:whispering|>, <|prosody:speed_very_slow|>, and <|sfx:laughter|> can be inserted anywhere in the input string, giving character-level control over how each sentence sounds. This is the kind of expressiveness previously available only through commercial voice APIs with proprietary pricing.

The 100-language support is also a meaningful unlock for multilingual content creators. 83 of those languages achieve sub-5% word error rate in production-quality range, with 17 more at 5 to 10%. Developers who previously needed to stitch together multiple regional models can now use a single self-hosted endpoint. If you have been following AI audio releases, the recent Miso TTS 8B covered a similar wave; Higgs v3 now sets a new multilingual baseline in this category.

Key Details

  • Languages: 100 total (83 with WER/CER under 5%, 17 at 5 to 10%)
  • Voice cloning: Zero-shot from a reference audio clip and transcript
  • Inline controls: 21 emotion tags, 3 style modes, 4 prosody controls, 4 sound effects
  • Audio quality: 24 kHz, 40 ms per frame
  • API format: OpenAI-compatible (POST /v1/audio/speech)
  • Throughput: 14.74 requests per second at 16x concurrency on a single H100
  • License: Non-commercial research only; commercial use requires Boson AI agreement
  • Weights: bosonai/higgs-audio-v3-tts-4b on HuggingFace

Full API documentation is available at docs.boson.ai.

What to Do Next

The fastest way to get started is the SGLang-Omni cookbook, which walks through pulling the model weights, launching a local server, and making your first voice cloning request. Hardware minimum is a single H100 GPU; the model runs via Docker with the lmsysorg/sglang-omni:dev image.

If you want to explore the model before setting up local infrastructure, Boson AI offers a hosted playground at their Workspace. The model weights are on HuggingFace under bosonai/higgs-audio-v3-tts-4b and can be downloaded with the HuggingFace CLI using your HF token.

For production deployments, contact Boson AI directly about commercial licensing. The non-commercial research license explicitly prohibits revenue-generating use.