Mira Murati's Thinking Machines Lab shipped its first model on May 11, 2026: TML-Interaction-Small, a 276-billion parameter mixture-of-experts system with 12B active parameters, built around a full-duplex architecture that listens, watches, thinks, and speaks at the same time. The research preview kills the turn-taking dance that defines today's voice agents, and the benchmark gap against OpenAI and Google is wider than the launch deck suggests.

For creators building voice apps, real-time tutors, livestream copilots, and interactive video experiences, this is the first model that treats conversation as a continuous stream rather than a ping-pong of complete utterances. Below is what was actually announced, how it compares to OpenAI's GPT-Realtime-2 and Google's Gemini 3.1 Flash Live, and how the workflow shifts once interaction models hit general availability.

What Happened

Thinking Machines published the "Interaction Models: A Scalable Approach to Human-AI Collaboration" paper and demo reel on May 11, 2026. The headline model is TML-Interaction-Small at 276B total parameters with 12B active, paired with a slower asynchronous background model that handles complex reasoning, web search, and tool calls while the foreground model keeps the conversation flowing.

Full-duplex voice: Listen and Speak waveforms active simultaneously

The core architectural bet is what the team calls "encoder-free early fusion." Instead of running audio and video through heavy separate encoders before fusing them with text tokens, raw signals are processed in 200-millisecond chunks and folded into the model directly. Techcrunch's coverage frames it as turning conversation "into a phone call rather than a text chain," which is the right intuition: input and output tokens are treated as streams, not as user-turn followed by assistant-turn.

Access is gated. The research preview is open to roughly 19 partners, with a wider public release "later in the year," according to SiliconAngle's launch writeup. No pricing has been disclosed. Serving runs on SGLang.

The Numbers That Matter

Three benchmarks anchor the comparison. FD-bench V1.5 measures interaction quality on a 0-100 scale, weighing whether the model interrupts gracefully, holds the floor when needed, and recovers from overlapping speech. Audio MultiChallenge measures cross-domain audio intelligence (math while listening to a song, code-switching, instruction-following under audio noise). Turn-taking latency measures how long the model waits before starting to reply once the user falls silent.

TML 276B model performance gauge in high zone
MetricTML-Interaction-Small (276B/12B)GPT-Realtime-2 (minimal)Gemini 3.1 Flash Live
FD-bench V1.5 (interaction quality)77.846.854.3
Audio MultiChallenge (intelligence)43.4%37.6%26.8%
Turn-taking latency0.40s1.18s0.57s
ArchitectureEncoder-free MoE, 200ms micro-turnsCascade (ASR + LLM + TTS)Native multimodal, encoder-fused
AccessResearch preview, ~19 partnersGA via OpenAI Realtime APIGA via Gemini Live API
Source: Thinking Machines blog (May 11, 2026) and partner-reported benchmarks aggregated by Latent Space's AINews.

The 0.40s turn-taking number is the headline a creator should care about. Below 0.5s is the threshold where conversation stops feeling like a transaction. OpenAI's Realtime API sits at 1.18s in minimal mode, which is why most voice agents built on it still get described as "fast assistants" rather than "conversation partners." Gemini Live closes the gap to 0.57s. TML cuts it nearly in half from there.

What's Actually New: Visual Proactivity and Time Awareness

Most "realtime" voice models today are still passive: the user speaks, the model responds. TML-Interaction adds two capabilities that change what creators can build.

Camera lens connected to clock icon representing visual proactivity

The first is visual proactivity. The model continuously processes video frames and can speak unprompted when something visually meaningful happens. In the demo reel, the model counts pushup reps without being asked, calls out when a user's running form breaks, and answers questions the moment the relevant visual evidence enters the frame. Thinking Machines built five custom benchmarks (TimeSpeak, CueSpeak, RepCount-A, ProactiveVideoQA, Charades) specifically to measure this, because no existing benchmark covers "speak at the right moment in a continuous stream."

The second is internal time awareness. The model knows how much wall-clock time has passed and can act on it. "Remind me to drink water every four seconds" works. "Tell me when 90 seconds have passed" works. The cascade architectures used by GPT-Realtime and most production voice agents can't do this cleanly because they discard time information at the ASR boundary.

For comparison context, Inworld's Realtime TTS-2 launch on May 5 also targeted sub-second latency, but it's a TTS layer paired with a separate LLM. TML-Interaction is the first model where the voice, vision, and reasoning live in a single forward pass at conversational speed.

Workflow Shift: What Creators Should Plan For

The research preview is partner-only today, so no one is shipping a TML-powered app this week. The right move is to plan the workflow shift now so the moment public access opens, the production pipeline is ready.

Three creator-facing workflows shift once interaction models reach GA:

  1. Voice tutoring and coaching apps. Today's apps using OpenAI Realtime or Gemini Live build a "wait for user to finish, transcribe, reason, speak" loop. With TML-Interaction, the coach can correct form mid-rep, interrupt a wrong answer the instant it's spoken, and hold the floor through a verbal hesitation without losing context. The conversational design budget moves from "what does the agent say back" to "when does the agent say anything at all."
  2. Livestream copilots and creator companions. A streamer can now have a copilot that watches the gameplay feed, listens to chat, and speaks into the broadcast at the moment something interesting happens, without explicit commands. The visual proactivity benchmark is doing the work that today requires custom event triggers wired by hand.
  3. Interactive video and roleplay characters. NPCs and AI actors that hold a continuous conversation, react to player gestures captured on webcam, and break the fourth wall when something visually surprising happens. Today's roleplay platforms ship turn-based chat with optional voice. Interaction models make the roleplay-as-conversation experience plausible.

The unlock is not raw model intelligence (GPT-5.5 and Gemini 3.1 Flash are stronger reasoners). The unlock is that the interaction surface no longer feels like talking to a transcript.

What to Do Next

If you're shipping a voice product today, three concrete steps make sense this week. First, request access to the Thinking Machines research preview through the company site; the partner list is small but the application window is open. Second, prototype your current voice flow against Gemini Live (linked in the FAQ below) to get a feel for sub-second latency, then design your conversation model around interrupt-and-recover patterns rather than turn-based dialog. Third, instrument your current voice analytics to measure interaction quality, not just transcription accuracy: capture mean turn-taking latency, interrupt-recovery rate, and silence-misclassification rate as baselines you can compare against once TML-Interaction opens up.

Frequently Asked Questions

Is TML-Interaction-Small open weights?

No. The research preview is API access to a small group of partners, and Thinking Machines has not committed to releasing weights. The blog post calls this a "research preview" and signals a hosted-API path for general availability later in 2026.

How is this different from OpenAI's Realtime API?

OpenAI's Realtime API is a cascade: speech recognition feeds a text LLM, which feeds a text-to-speech model. The pipeline is faster than older approaches but still introduces sequential latency at each stage. TML-Interaction is a single model that processes audio, video, and text natively in 200ms micro-turns, so input and output tokens stream concurrently rather than alternating.

What does "full-duplex" mean in practice?

Full-duplex means the model can listen and speak at the same time, like a phone call. Half-duplex systems (most current voice agents) require one party to finish before the other starts, like a walkie-talkie. With full-duplex, the model can backchannel ("uh huh"), interrupt incorrect answers, and recover from being interrupted itself without losing the conversation thread.

Can I build production apps on it today?

Only if your team is one of the ~19 research preview partners. Public API access is targeted for later in 2026, with no firm date. For production work today, OpenAI's Realtime API and Google's Gemini Live API are the realistic options.

Does it handle video as input?

Yes. Video is processed in the same 200ms micro-turn stream as audio. The visual proactivity benchmarks (RepCount-A, ProactiveVideoQA, Charades) measure the model's ability to act on continuous visual input without explicit user prompts. This is the capability that opens up livestream copilots, interactive video, and gesture-aware roleplay characters.

What happens to background reasoning and tool calls?

The system pairs TML-Interaction-Small with an asynchronous background model. The foreground model keeps the conversation flowing in real time. The background model handles slower work (web search, code execution, multi-step reasoning) and feeds results back when they're ready. This split is closer to how humans think while talking than the all-in-one architecture of GPT-Realtime or Gemini Live.

How does pricing compare to OpenAI and Google's voice APIs?

Thinking Machines has not published pricing. OpenAI's Realtime API runs around $100 per million audio output tokens at GPT-Realtime tier; Gemini Live is in the same ballpark with discounts for batched access. A 276B parameter MoE model with 12B active is in the bracket where hosted pricing per minute is plausible but not guaranteed; expect Thinking Machines to launch with a usage-tier model similar to OpenAI's.