Miso Labs released MisoTTS, an 8 billion parameter text-to-speech model, on June 3 with open weights, one-shot voice cloning, and SilentCipher watermarking baked in by default. The model targets the gap between cloud TTS services that gate emotional control behind subscription tiers and existing open-weights TTS that sounds flat in narration-length output.

How to integrate MisoTTS into your voice pipeline

If you already record narration for shorts, podcast intros, or course modules, the workflow is short: clone the MisoTTS GitHub repo, install with pip on Python 3.10+, and feed the inference script a 10-second reference clip of your own voice plus the text you want spoken. The model handles voice cloning through audio context conditioning, so you do not need a separate fine-tuning step. The Llama 8B backbone runs in bfloat16 on a single high-VRAM CUDA GPU, which keeps it on-prem for clients who do not allow voice data to leave the network. Watermarking via SilentCipher fires automatically on every generation.

Why It Matters

Open-weights TTS at 8B parameters changes the math for indie creators who run their own narration. Cloud services like ElevenLabs charge per-character for the high-emotion tiers and require sending source text plus voice samples to a third party. A self-hosted model with one-shot cloning collapses both costs into a fixed hardware bill, and the default audio watermark gives a defensible provenance trail if a clone is ever misused. Coverage from MarkTechPost frames MisoTTS as a Sesame-CSM descendant, which signals the architecture has prior art for conversational pacing rather than the flat read-aloud cadence that limited earlier open releases like MOSS Audio 8B.

Key Details

MisoTTS pairs a 7.7B Llama-architecture backbone with a 300M audio decoder that autoregressively predicts higher-order audio codebooks within each frame. The system uses a Mimi tokenizer with 32 codebooks and a 2,051-entry audio vocabulary, organized as a hierarchical residual vector quantization (RVQ) transformer. English is the only supported language at launch. Weights ship under a modified MIT license that Miso Labs posted alongside the model card, with API access listed as "coming soon" on the company blog. The GitHub repository hit 1.2k stars within 24 hours of release, and the team credits the watermarking layer to Sony's SilentCipher implementation, which embeds a robust signal that survives common audio transformations.

What to Do Next

If you need narration for a project this week, pull the repo and test MisoTTS against your current pipeline on three samples: a 60-second intro, a 30-second mid-roll, and an emotional beat that you usually flag for human re-records. Compare the output against your current TTS subscription, including how the SilentCipher watermark interacts with your post-processing chain. For paid-tier creators currently using ElevenLabs Music v2, treat MisoTTS as a narration-only complement, not a replacement: music generation is a separate problem. Watch for the official Miso Labs hosted API to land, which the company said will follow shortly after the open-weights drop.