Supertone released Supertonic 3 on May 15, 2026, a 99M-parameter open-weights text-to-speech engine that runs on CPU and now speaks 31 languages. The model ships with ONNX assets, expression tags for laughter and breath sounds, and zero-shot voice cloning from a short reference clip. Synthesis happens entirely on the user's device, with no API calls and no per-second pricing.
How to Try Supertonic 3 in 10 Minutes
Install the Python package and synthesize your first clip without a GPU. The official quickstart on GitHub takes four lines once the PyPI release is installed:
pip install supertonicon any Python 3.10 or newer environment.- Load the engine with
tts = TTS(auto_download=True). The 99M-parameter model downloads once and caches locally. - Pick a preset voice with
tts.get_voice_style(voice_name="M1"), or pass a 5 to 10 second reference WAV for zero-shot cloning. - Call
tts.synthesize(text, voice_style=style, lang="en")and save the WAV. Expression tags like<laugh>,<breath>, and<sigh>work inline inside the text string.
Output is 44.1 kHz, 16-bit, ready for video voiceovers, podcast intros, or embedded narration in apps. The same ONNX weights also load through Flutter, .NET 9, Go, and the browser via onnxruntime-web, so a single voice profile carries across desktop, mobile, and web pipelines.
Why It Matters
Most quality TTS systems still route through paid cloud APIs at roughly $15 to $25 per million characters. MarkTechPost reports that Supertonic 3 reduces repeat and skip failures versus v2 while expanding from 5 to 31 supported languages, which puts it in the same intelligibility band as larger 0.7B to 2B cloud TTS models. For a creator producing 50 long-form narrations a month, that is the difference between a recurring $200 cloud bill and a one-time download. The CPU-only profile also unlocks offline use on a laptop or Raspberry Pi, useful for field recording or air-gapped client work.
What Is Different in Version 3
Three changes matter for working creators. First, the language list grew from 5 to 31 ISO codes, adding Japanese, Arabic, Hindi, Ukrainian, Vietnamese, and 21 others. Speaker similarity holds across the shared-language set, so a single cloned voice can narrate the same script in multiple languages without sounding like a different person. Second, expression tags ship in v3 only. Inline <laugh>, <breath>, and <sigh> let scripted narration carry the small mouth sounds that podcasts and audiobooks rely on. Third, the Hugging Face model card and the live Hugging Face Space publish the full ONNX assets under a permissive code license (MIT) with an OpenRAIL-M model license, which clears most commercial creator workflows.
What to Do Next
If you already run a self-hosted voice stack, swap Supertonic 3 in as your default narrator and keep your cloud TTS for the languages and expressive ranges it still wins on. For PDF, EPUB, and DOCX narration, pair Supertonic 3 with OpenReader v3 to keep the whole document-to-audio pipeline local. For voice cloning specifically, compare zero-shot quality against Scenema Audio, the LTX 2.3-derived open voice model from earlier this week, before you commit to one stack.