Cartesia Sonic-3.5 and Ink-2: #1 Streaming Voice AI

Cartesia launched Sonic-3.5 and Ink-2 on June 15, 2026, a paired text-to-speech and speech-to-text release that the company says ranks first on both the Artificial Analysis streaming leaderboards for naturalness and accuracy. Founder Karan Goel framed it plainly: Cartesia is now the only provider holding the number one streaming model for both speaking and listening at the same time. For anyone building a real-time voice product, that combination, low latency on both ends from a single vendor, is the actual story.

What Cartesia Launched

The release is two models that are meant to be used together. Sonic-3.5 is the text-to-speech side: it turns text into speech with what Cartesia describes as more natural pacing, rhythm, and emotional range than its previous Sonic-3 model, plus cleaner read-out of codes, IDs, and alphanumeric strings. Ink-2 is the speech-to-text side: a streaming transcription model with built-in turn detection and semantic endpointing, tuned to stay accurate in noisy real-world audio.

Both are built on Cartesia's state space model architecture rather than the transformer stack most voice models use, which is the technical reason the company can push latency down. Cartesia reports a sub-300ms median time to first token for Sonic-3.5, low enough that a back-and-forth conversation does not feel like it is buffering. The models are generally available now through Cartesia's WebSocket and REST APIs and a public playground, with SDKs for Python and TypeScript and direct integrations for the Pipecat and LiveKit voice-agent frameworks.

Cartesia text-to-speech and speech-to-text working together — Cartesia now owns both ends of the voice pipeline: Sonic-3.5 out, Ink-2 in.

Why Holding Both Ends Matters

A live voice agent is a loop: it listens, transcribes, thinks, and speaks, then does it again. Every link in that loop adds delay, and delay is what makes synthetic conversation feel robotic. Most teams assemble this loop from separate vendors, one company's transcription feeding another company's reasoning model feeding a third company's voice. Each handoff is a network hop and a place where turn-taking breaks down, where the agent talks over the user or sits in dead silence.

Owning both the listening model and the speaking model lets Cartesia tune the handoff itself. Ink-2's turn detection knows when a speaker has actually finished a thought rather than just paused, and that signal can trigger Sonic-3.5 to start talking at the right moment. Cartesia says Sonic-3.5 spans more than 40 languages on its launch page, though its own changelog notes that language coverage is still expanding from an English-first base, so multilingual teams should test their specific languages before committing. The point of the dual release is not a single headline number. It is that the two halves of the conversation come from one place and are optimized to work as a pair.

How Cartesia Compares to the Voice AI Landscape

The voice AI market has split into camps, and no single tool wins every job. Cartesia is making a focused bet on real-time conversational agents, where latency is the whole game. That is a different target from expressive long-form narration or self-hosted open weights. Here is how the major options line up for creators and builders choosing a voice stack in mid-2026.

Where each voice AI option fits in 2026
Tool	Type	Best for	Access
Cartesia Sonic-3.5 plus Ink-2	Streaming TTS and STT	Real-time voice agents, low-latency back-and-forth	Closed API and playground
ElevenLabs	Expressive TTS, dubbing, music	Narration, audiobooks, character voices, localization	Closed API
Higgs Audio v3	Open-weights TTS	Self-hosting, voice cloning, 100-plus languages	Open weights, 4B
NVIDIA Nemotron 3.5 ASR	Open transcription	Fast multilingual speech-to-text on your own hardware	Open, 40 languages

If your product is a phone agent, a live tutor, or an in-app assistant that has to respond in the rhythm of human speech, Cartesia's pitch is aimed squarely at you. If you are producing voiceover, dubbing a video, or generating character performances, ElevenLabs and its expressive long-form voices remain the stronger fit. And if you need to run everything yourself for cost or privacy reasons, the open-weights route through Higgs Audio v3 or an open ASR model like NVIDIA Nemotron 3.5 gives you control at the cost of doing the latency engineering yourself.

Cartesia standing out among competing voice AI tools — Cartesia competes with ElevenLabs and others on latency and owning the full loop.

The Workflow: Wiring a Real-Time Voice Agent

The reason Cartesia ships SDKs for Pipecat and LiveKit is that almost nobody builds a voice agent from raw audio sockets anymore. Here is the shape of a typical build using the new models.

1. Capture and transcribe. Stream microphone audio into Ink-2 over a WebSocket. It returns a live transcript and, critically, an endpoint signal when the user has finished speaking, so you are not guessing with a fixed silence timer.

2. Reason. Pass the finished transcript to your language model of choice for the actual logic, the answer, the tool call, the lookup. This is the one piece Cartesia does not provide, and it is deliberately model-agnostic.

3. Speak. Stream the model's text response into Sonic-3.5 and play the audio back as it generates, rather than waiting for the full reply. Because time to first token is low, the agent starts speaking almost immediately.

4. Orchestrate. Wrap the loop in Pipecat or LiveKit, which handle the audio plumbing, interruption handling, and barge-in (letting the user cut the agent off mid-sentence) so you are not rebuilding that yourself.

The design choice that makes this practical is that turn detection lives in the transcription model. In a multi-vendor stack you often bolt on a separate voice-activity-detection step; here it is native, which removes one of the most common sources of awkward timing.

Listen reason speak real-time voice agent pipeline — A real-time agent loops Ink-2 transcription, an LLM, and Sonic-3.5 speech.

Limitations and What to Watch

The leaderboard claims rest on Artificial Analysis data, and benchmark rankings move as competitors ship. ElevenLabs and the open-weights camp are not standing still, and a number one streaming score from May data is a snapshot, not a permanent title. Language coverage is the other open question: the gap between the launch page's 40-plus-languages claim and the changelog's English-first framing means multilingual teams should verify their exact languages in the playground before building on them. Pricing is usage-based and tier-dependent, and Cartesia leads with a promotional offer rather than transparent public per-minute rates, so model your real conversation volume before committing.

The deeper thing to watch is whether single-vendor voice stacks become the default. For two years the pattern has been to stitch best-of-breed pieces together. Cartesia is betting that for real-time agents, an integrated listen-and-speak pair beats a stitched one. If the latency advantage holds up under load, that bet reshapes how voice products get built.

How to Try It

Start in Cartesia's playground to hear Sonic-3.5 and test Ink-2 on your own audio before writing any code. If the quality holds for your use case, the fastest path to a working agent is the Pipecat or LiveKit starter, which wire both models into a runnable loop in well under an hour. Teams already using the older Sonic-3 should note that professional voice clones do not carry over to the new model automatically and will behave as standard clones, so plan a re-clone pass.

Frequently Asked Questions

What are Cartesia Sonic-3.5 and Ink-2?

Sonic-3.5 is Cartesia's text-to-speech model and Ink-2 is its speech-to-text model, both launched June 15, 2026 as a paired streaming voice stack. Cartesia says each ranks first on the Artificial Analysis streaming leaderboards, making it the only provider with the top model for both generating and transcribing speech at once.

What makes Cartesia different from ElevenLabs?

Cartesia is optimized for real-time, low-latency voice agents where response speed and turn-taking matter most, and it provides both the transcription and the speech models. ElevenLabs leads on expressive long-form narration, dubbing, and character voices. The right choice depends on whether you are building a live conversational agent or producing recorded voiceover.

How low is the latency?

Cartesia reports a sub-300ms median time to first token for Sonic-3.5, fast enough for natural back-and-forth conversation. Ink-2 adds streaming transcription with native turn detection so the agent can tell when a user has actually finished a thought rather than just paused.

Can I self-host these models?

No. Sonic-3.5 and Ink-2 are closed models available through Cartesia's WebSocket and REST APIs and playground. If you need to run voice models on your own hardware, look at open-weights options such as Higgs Audio v3 for text-to-speech or NVIDIA Nemotron 3.5 for transcription instead.

How do I build a voice agent with them?

Stream audio into Ink-2 for transcription and turn detection, pass the finished transcript to any language model for reasoning, then stream the response into Sonic-3.5 for playback. Cartesia ships SDKs and direct integrations for the Pipecat and LiveKit frameworks, which handle the audio orchestration and interruption logic.

What languages do they support?

Cartesia's launch page lists more than 40 languages for Sonic-3.5, but its own changelog describes language support as expanding from an English-first base. Multilingual teams should test their specific target languages in the playground before building production features on them.