ScenemaAI released Scenema Audio on Hugging Face and GitHub this week, an open-weights expressive text-to-speech and zero-shot voice cloning model built on the audio half of Lightricks' LTX-2 audiovisual stack. The release ships an audio diffusion transformer with a Gemma 3 12B text encoder, supports 13 languages, and runs in real time on a 24 GB consumer GPU. Inference code is MIT-licensed, model weights are under the LTX-2 Community License.
Try it: Clone a Voice in Under an Hour
The fastest path is the Docker setup in the scenema-audio repo: clone, accept the Gemma 3 gate on Hugging Face, run the container, and the first start pulls about 38 GB of weights into a Docker volume. Once it is warm, point the inference server at a 10-20 second reference clip of any voice (your own narration, a public-domain reading, a licensed VO clip) and prompt the model with both the line and a stage direction, for example "whispered, exhausted, near tears." The model treats the prompt as a scene description, not just a script, so you get pacing, breath, and emotional arcs in a single pass instead of stitching takes together in a DAW.
Why It Matters
The open-weights TTS shelf has been crowded since OpenVoice and F5-TTS, but most options either lack emotion control or require pitch-shift hacks for child voices and scene ambience. Scenema Audio is the first MIT-inference release with stage-direction prompts and native scene-aware audio (rain, crowds, thunder mixed inline) at usable real-time factors. The project demo page shows the model swap inside a generation: angry to laughing inside one render, no segment splicing. For solo creators producing audiobook narration, game dialogue, or YouTube voice-over without a vocal booth, the practical difference is that the dub no longer has to ship through ElevenLabs' API pricing to sound directed.
Key Details
Hardware: 16 GB VRAM gets you INT8 weights with the Gemma encoder streaming on CPU (slow, about 7 seconds per chunk). 24 GB is the default config (Gemma NF4 on GPU, 0.2 seconds per chunk). 48 GB unlocks bf16 quality. On an RTX 4090 the real-time factor is 0.66x to 1.57x depending on chunk size. Supported languages: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili. The 22B parent model is Lightricks' LTX-2, the same stack now used for open-source 4K video plus audio generation. A reference YouTube walkthrough is up at youtu.be/VnEQ_ImOaAc.
What to Do Next
If you already have a Comfy or local-inference rig with 24 GB+ VRAM, swap Scenema Audio into your narration step and A/B it against your current voice-clone tool on one paid clip. If you do not have local hardware, watch for ComfyUI nodes and Replicate endpoints to land in the next two weeks (the LTX-2 video model got both within ten days). For broader context on the expressive-TTS race, our 2026 AI music and audio guide and Inworld TTS 2 deep dive cover the closed-source benchmarks Scenema Audio is now competing with.