Inworld AI released Realtime TTS-2 on May 5 as a research preview, claiming sub-200 millisecond time-to-first-audio, voice cloning from 5 to 15 second clips, and a single voice identity that holds across more than 100 languages. The model accepts plain-English direction like "tired but warm after a long day" instead of preset emotion tags, and ranks above Google and ElevenLabs on the Artificial Analysis Speech Arena leaderboard. For voice creators, the release collapses a tradeoff that has split the text-to-speech market for two years: sub-200 millisecond latency for live agents on one side, frontier-grade quality for narration on the other. TTS-2 is the first hosted model to claim wins on both axes at the same time.

Background

Charcoal stopwatch with an orange hand pointing just past start, illustrating sub-200 millisecond TTS latency.
Sub-200 millisecond first-byte latency.

Inworld built its early reputation on character voices for game studios. The company spent 2023 and 2024 quietly accumulating a research team and shipping TTS 1.5 in late 2024, then pushed past Google, ElevenLabs, OpenAI, and Cartesia on the Speech Arena leaderboard with a follow-on update in early 2025. By the spring of this year Inworld held three of the top five Speech Arena slots simultaneously, a result that the rest of the field treated as a benchmark anomaly. The May 5 release of Realtime TTS-2 is the company's argument that the leaderboard position is not noise. The model is sold as a research preview, but the surface area is large: a streaming REST endpoint, a realtime WebSocket endpoint that speaks the OpenAI Realtime protocol with Inworld extensions, first-party Node and Python SDKs, and integrations with LiveKit Agents, Pipecat, Cloudflare, DeepInfra, GMI Cloud, Stream, and VoiceRun, per Testing Catalog's launch coverage.

Deep Analysis

Three charcoal dial knobs in a row, the middle ringed in orange, representing emotion-control parameters.
Emotion-control dials in the TTS API.

One model that wins both narration and live agents

The TTS market through April 2026 was a two-vendor split. ElevenLabs owned narration: audiobook houses, indie podcast producers, and YouTube creators paid $11 to $99 a month for voices that sounded recorded rather than generated, and tolerated several seconds of latency to get them. Cartesia owned conversational agents: customer support bots, voice-first apps, and live game NPCs paid for sub-300 millisecond first-token latency and lived with a quality ceiling that fell short of ElevenLabs in side-by-side listening tests. Every creator team building both a podcast and a chat agent had to maintain two SDKs, two billing relationships, and two distinct prompt styles.

TTS-2 collapses the split. The sub-200 millisecond time-to-first-audio claim sits below Cartesia's published numbers and is fast enough for full-duplex live agents that ping-pong with a user every 700 milliseconds. The Speech Arena ranking sits above ElevenLabs and Google in blind A/B tests, which is the qualitative bar narration buyers care about. The same model produces both modes, which means a creator running an audiobook line and a voice agent line on top of the same character can use one voice identity, one cloning sample, and one set of stability presets. The technical achievement is significant; the business consequence is bigger. A single API call is replacing a vendor decision.

Voice direction as a creator interface

The interface change is as load-bearing as the latency claim. ElevenLabs and most competing TTS systems expose emotion as a preset list: Happy, Sad, Angry, Neutral, with a few sliders for intensity. Voice actors who use these tools spend hours nudging sliders to coax a delivery that sounds natural rather than cartoonish. TTS-2 replaces the preset list with natural-language direction. A producer can write "tired but warm after a long day, ending the line on an exhale" directly in the request payload, and the model conditions delivery on the prose. The same approach drives advanced voice design, which generates an entirely new voice from a prose description ("a forty-year-old radio DJ from Brooklyn with a small smoker's rasp") without any reference audio at all.

For creators with a writing-room background, this is the same shift that prompt-driven image generation imposed on art directors three years ago. The skill that gets paid is the ability to write specific, concrete sensory direction. The skill that gets devalued is the ability to operate a vendor-specific control surface. Three stability modes (Expressive, Balanced, Stable) sit on top of the prose interface as a coarse-grain dial: pick Expressive when the creative variance is the point, Stable when the line is going into a published audiobook and reproducibility matters, Balanced when neither extreme is right. Inline non-verbal markers (whispers, sighs, laughter) drop into the text at exact timestamps so a director can place a sigh at 1.2 seconds without rendering, listening, and re-prompting.

Conversational awareness, voice cloning, and the live-agent stack

The architectural feature that makes the live-agent claim credible is conversational awareness. Most TTS systems treat each line in isolation: the model sees a string and emits audio, with no memory of what the user just said or how the previous line landed. TTS-2 conditions on prior audio context, so when a user gets quiet or asks a clarifying question, the model's response shifts tone automatically. A character that was upbeat in line three lands the next line with appropriate concern when the user's reply turned somber. This is not a parameter tweak. It is a different inference graph that takes audio plus text instead of text alone, and it is the feature that lets a single TTS model produce believable two-way conversations rather than well-narrated monologues.

Voice cloning rounds out the live-agent toolkit. A 5 to 15 second clip uploaded via a single POST to /voices/v1/voices:clone produces a custom voice that the model treats as a first-class identity. Cloned voices inherit the same crosslingual coverage as built-in voices, so a creator can record one English clip and use the resulting voice for Japanese, Spanish, Arabic, and Portuguese deliveries with mid-utterance language switching. Inworld's cloning sample length lands between Cartesia's three seconds and ElevenLabs' minute-or-more requirement, which positions TTS-2 as a realistic option for creators who can capture short recordings but cannot reasonably ask talent for studio sessions. The realtime WebSocket endpoint speaks the OpenAI Realtime protocol with Inworld extensions, so any client built against OpenAI voice agents can swap providers in minutes without restructuring the audio plumbing.

Pricing, licensing, and the creator-tier picture

Inworld is metering by audio output time on a pay-as-you-go schedule with volume tiers, with no model-side price change for customers upgrading from TTS 1.5. The company has not published a per-character or per-minute rate sheet for TTS-2 specifically, which is the one piece of the launch that requires a sales conversation rather than a self-service signup. ElevenLabs sits at $0.015 per 1,000 characters on the Creator tier and $0.18 per 1,000 characters on the Business tier, Cartesia is closer to $0.10 per 1,000 characters on its production tier, and OpenAI Realtime charges by audio token at roughly $0.06 per minute. The TTS-2 commercial picture is a research-preview gap that Inworld will close inside a few weeks; the technical substance is already shipping in production for early-access teams.

CapabilityInworld TTS-2ElevenLabs v3Cartesia Sonic 2OpenAI RealtimeHume Octave
Time-to-first-audioSub-200 ms500 ms to 2 s~290 ms~320 ms~500 ms
Voice cloning sample5 to 15 s1+ minute~3 sNot available~30 s
Languages with one voice100+3215~5011
Conversational awarenessYesNoLimitedYesYes
Voice design from proseYesNoNoNoYes
Plain-English directionYesLimited (v3)NoLimitedYes

Impact on Creators

Three-step podium with the central step in orange and slightly taller, illustrating Inworld TTS-2 beating ElevenLabs and Cartesia benchmarks.
Beats published benchmarks vs ElevenLabs and Cartesia.

For audiobook narrators, podcast producers, and indie audio drama teams, the practical change is that one TTS model is now plausibly the only TTS dependency in the stack. A team running an interview podcast can clone the host's voice for cold-open intros, swap to a Stable mode for sponsor reads, and use the same voice in Spanish and Japanese for international cuts without licensing a second model. For game studios building voiced NPCs, the conversational awareness feature lifts the realism ceiling for systemic dialogue, which is the genre most punished by isolated-line TTS. For voice agent builders working with LiveKit Agents or Pipecat, the OpenAI Realtime protocol compatibility means the migration cost is a configuration change rather than a refactor.

The piece creators should not skip is the voice direction interface. Treating prose direction as a craft skill, with the same care a director would bring to a session note for a human voice actor, separates teams that produce flat reads from teams that produce performances. The model rewards specificity. "Tired but warm" is better than "warm." "Tired but warm, ending the line on an exhale, with a quarter-second pause before the final word" is better still. The teams that ship the best-sounding TTS-2 work in the first six months will be the ones that reuse the muscle memory from prompt-driven image generation rather than the muscle memory from emotion-slider TTS tools.

Key Takeaways

  • Inworld TTS-2 collapses the narration-versus-conversation split that defined the TTS market through April 2026, claiming sub-200 ms latency and Speech Arena top-rank quality from one model.
  • Plain-English direction replaces preset emotion tags, which rewards creators who can write specific sensory prose and devalues vendor-specific slider expertise.
  • Conversational awareness conditions delivery on prior audio, which is what makes the live-agent claim credible for full two-way dialogue rather than well-narrated monologues.
  • Voice cloning from a 5 to 15 second clip plus crosslingual coverage across 100+ languages reframes the localization economics for creator-tier teams.
  • OpenAI Realtime protocol compatibility makes provider swaps inside LiveKit Agents or Pipecat a configuration change rather than a refactor.

What to Watch

Three things will decide whether TTS-2 holds the lead it is claiming. First, the public price sheet. Research-preview status is acceptable for two or three weeks, but the creator economy moves on per-character and per-minute rates that fit into a Stripe metering line. The vendors that lose creator share are the ones that force a sales call. Second, the open-weights response. Voicebox-class projects are closing the quality gap with hosted incumbents at a faster cadence than the image-generation timeline did, and a strong open release before the end of Q3 would compress TTS-2's window to lock in audio-creator workflows. Our Voicebox open-source voice studio guide is the closest local-first comparison point today. Third, the ElevenLabs counterstrike. The company's recent platform revamp set the pricing and feature bar TTS-2 is now claiming to clear; a sub-200 millisecond v4 release with conversational awareness would put the market back into a two-vendor war rather than a one-model coronation. The next 60 days are the window where one of those three resolves.