Fish Audio released S2, an open-source text-to-speech model trained on over 10 million hours of audio data spanning 50 languages. The model beats GPT-4o-mini-tts on the EmergentTTS-Eval benchmark with an 81.88% win rate, and ships with full weights, fine-tuning code, and a streaming inference engine. It launched on Product Hunt on March 10, 2026, pulling 274 upvotes on day one.

What Happened

S2 uses a dual-autoregressive architecture with reinforcement learning alignment to generate speech that sounds natural across dozens of languages. The standout feature is inline emotion tagging: creators can embed natural-language cues like [laugh], [whispers], or [professional broadcast tone] directly in the input text to control how the model delivers each line.

This is not a simple pitch-and-speed control. The tags shape the emotional texture of the output, letting creators script nuanced performances without needing to record multiple takes or hire voice actors for different tonal variations.

Fish Audio released everything: model weights, fine-tuning code, and a streaming inference engine built for real-time applications. The streaming capability means developers can integrate S2 into live products like chatbots, accessibility tools, or interactive content without waiting for full audio generation to complete.

Why It Matters for Creative Professionals

Open-source TTS at this quality level changes the economics of audio content. Podcasters, video creators, and game developers who previously relied on expensive voice synthesis APIs or voice actor sessions now have a free alternative that outperforms a leading commercial model on standardized benchmarks. It joins a growing wave of high-quality open-source audio AI, including IBM's Granite 4.0 Speech model that now leads the OpenASR Leaderboard for transcription.

The emotion tagging system is particularly useful for narrative content. Audiobook producers, animation studios, and explainer video creators can script emotional beats directly into their text, reducing post-production editing. A single prompt can shift from a whispered aside to a confident broadcast delivery without switching models or recording separate clips.

The 50-language support opens multilingual content creation. Creators building courses, marketing materials, or apps for global audiences can generate natural speech across languages without sourcing voice talent for each one.

Fine-tuning access means creators can train S2 on specific voice profiles or speaking styles. A brand could develop a consistent AI voice for all its content, or a developer could build a character voice library for a game.

Key Details

Model: Fish Audio S2 (open-source, full weights available)

Training data: 10M+ hours of audio across 50 languages

Architecture: Dual-autoregressive with RL alignment

Benchmark: 81.88% win rate vs GPT-4o-mini-tts on EmergentTTS-Eval

Emotion control: Inline natural-language tags ([laugh], [whispers], [professional broadcast tone])

Includes: Weights, fine-tuning code, streaming inference engine

Launch: March 10, 2026 (Product Hunt, 274 upvotes)

What to Do Next

Visit Fish Audio to access the S2 model weights and documentation. The model is also available on Hugging Face and GitHub. If you have a GPU setup for inference, you can run the model locally today. The streaming engine is ready for integration into real-time applications.

Test the emotion tagging on a script you are currently producing. Write a paragraph with [whispers] and [professional broadcast tone] tags to hear the difference firsthand. Compare the output quality against whatever TTS solution you currently use.

If you build multilingual content, test S2 across your target languages. Performance on 50 languages trained with 10M+ hours of data is worth benchmarking against your current pipeline, especially if you are paying per-character for commercial APIs.


This story was featured in Creative AI News, Week of March 10, 2026. Subscribe for free to get the weekly digest.