OpenAI released three new voice intelligence models through its Realtime API on May 7, 2026: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. For creators who produce podcasts, videos, or multilingual content, the update introduces practical workflow tools that were previously only available through third-party services or asynchronous batch processing.
The biggest shift is the availability of streaming speech-to-text at $0.017 per minute and live 70-language-to-13-language translation at $0.034 per minute, both accessible through the same Realtime API that already powers voice applications built on ChatGPT. Creators who have been cobbling together Whisper batch jobs and external translation services now have a single, low-latency API endpoint for both tasks.
The Three New Models

GPT-Realtime-Whisper is a streaming speech-to-text model built for ultra-low latency. Unlike the original Whisper, which processes audio in chunks and returns results after the fact, GPT-Realtime-Whisper transcribes speech as it happens. That difference matters for live workflows: podcast recording with rolling transcripts, live captions during video streams, and real-time meeting notes that appear as you speak.
GPT-Realtime-Translate handles live speech translation from more than 70 input languages into 13 output languages. The use case for creators is direct: record a video in English and get a simultaneous text translation in Spanish, French, Japanese, or German, without post-production delay. According to 9to5Mac coverage, the model keeps pace with a natural speaking rate.
GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning built in. Its context window expanded from 32K to 128K tokens, enabling longer conversations and multi-step task execution via voice commands. For creators, this is the backbone of voice-controlled creative tools: imagine directing an editing workflow through spoken commands that the model actually understands and executes.
Pricing Breakdown

| Model | What it does | Pricing | Best for |
|---|---|---|---|
| GPT-Realtime-Whisper | Streaming speech-to-text | $0.017/min | Podcasters, streamers, video producers |
| GPT-Realtime-Translate | Live speech translation (70+ languages in, 13 out) | $0.034/min | Multilingual content creators |
| GPT-Realtime-2 | GPT-5-class voice reasoning (128K context) | Per token (audio) | Voice-controlled app builders, creative tooling |
For a one-hour podcast episode, GPT-Realtime-Whisper transcription would cost roughly $1.02: less than most human transcription services and faster than batch processing through assembly tools.
How This Compares to Existing Tools
| Tool | Strengths | Gaps vs. OpenAI Realtime |
|---|---|---|
| Original OpenAI Whisper (batch) | High accuracy, free to self-host | Not real-time; requires post-processing |
| ElevenLabs Flows | Multi-modal pipeline, TTS and SFX | No live transcription; no translation |
| Inworld Realtime TTS 2 | Voice direction for character audio | Designed for characters, not creator workflows |
| Google Cloud Speech-to-Text | High accuracy, enterprise-grade | Separate product from translation; no voice reasoning layer |
| GPT-Realtime suite (new) | Transcription + translation + reasoning in one API | Developer integration required; no native app UI yet |
Creator Workflow: Live Podcast Transcription Setup
Here is how to add GPT-Realtime-Whisper to a podcast recording workflow using the OpenAI Realtime API:
- Get API access: Log into your OpenAI account and navigate to the Playground to test GPT-Realtime-Whisper before integrating it into your setup. The Playground supports all three new models.
- Open an audio stream: Use the Realtime API WebSocket connection to send raw audio from your recording software (Audacity, Adobe Audition, etc.) as a live stream.
- Receive rolling transcripts: The API returns partial transcripts as you speak, which you can pipe into a text file, a teleprompter display, or a show notes template in real time.
- Post-process with GPT-Realtime-2: After recording, pass your transcript to GPT-Realtime-2 (or a standard GPT-5 call) to generate chapter markers, highlight timestamps, or extract key quotes for social media.
- Export the transcript: Clean the rolling transcript into a formatted document for publishing alongside your episode, providing SEO value, accessibility, and repurposing in one step.
The per-minute pricing means a 60-minute episode costs $1.02 to transcribe live. See how ElevenLabs Flows handles multi-modal pipeline production for a different approach to the same workflow challenge.
Creator Workflow: Multilingual Video Dubbing

GPT-Realtime-Translate is the more ambitious offering for creators with international audiences. The 70-input-to-13-output language matrix covers most major creator markets. A practical workflow:
- Record your video in your primary language (English, Spanish, etc.).
- Stream the audio through GPT-Realtime-Translate during recording or in a post-production pass.
- Receive a live text translation output which can be used for subtitles, closed captions, or fed into a TTS model for dubbed audio tracks.
- Pair with a voice cloning tool (ElevenLabs, xAI Grok Voices) to generate a dubbed track in your voice in the target language.
The live translation piece is the missing link in that chain. Previously, creators needed to export, upload to a translation service, wait for results, and re-sync. According to TechCrunch on the launch, OpenAI is positioning this for education and creator platforms serving global audiences. Coverage includes the dominant creator languages: English, Spanish, French, German, Japanese, Korean, and Portuguese.
What Is Not Here Yet
These are API models, not consumer apps. Creators who are not developers need to either wait for platforms to integrate these models or work with someone who can build the integration. There is no native Descript, Riverside.fm, or CapCut integration announced yet.
GPT-Realtime-2's per-token pricing is not published as a flat rate, which makes budgeting for longer voice sessions difficult. The translate and whisper models are simpler to predict at per-minute billing.
The 13 output languages for translation cover major markets but exclude several high-growth creator markets. OpenAI has not published the full list. Review the StartupHub AI coverage for early developer notes as the community maps out which language pairs perform best.
Related deep dives:
- ElevenLabs at $500M ARR: What It Means for Voice AI in Creator Workflows
- Inworld Realtime TTS 2: Voice Direction for AI Characters
- xAI Grok 4.3 API: Custom Voices and What Creators Can Build
What to Do Next
If you produce podcasts or record video content with spoken narration, GPT-Realtime-Whisper is the immediate action item. Test the model through the OpenAI Playground before building an integration: the Playground supports all three new models and provides a no-code test environment. For multilingual reach, map your target audience languages against the 13 supported output languages and check whether your primary markets are covered before investing in the translation pipeline.
Developers looking to integrate voice reasoning should review the blockchain.news summary of the Realtime API update for technical context on the expanded 128K context window in GPT-Realtime-2 and how it changes what voice applications can do end-to-end. The Forasoft production guide for the Realtime API covers WebSocket connection setup for those ready to start building.
Frequently Asked Questions
What is GPT-Realtime-Whisper and how is it different from the original Whisper?
GPT-Realtime-Whisper is a streaming speech-to-text model that transcribes audio in real time as you speak. The original Whisper processes audio in batches after recording ends. The new model delivers rolling transcripts during recording, making it useful for live captions, podcast session notes, and real-time editing workflows. It costs $0.017 per minute via the OpenAI Realtime API.
What languages does GPT-Realtime-Translate support?
GPT-Realtime-Translate accepts speech input in more than 70 languages and translates into 13 output languages. OpenAI has not published the complete list, but coverage includes the major creator markets: English, Spanish, French, German, Japanese, Korean, and Portuguese. It costs $0.034 per minute.
Do I need to be a developer to use these models?
Yes, for now. These are API-level models accessed through the OpenAI Realtime API WebSocket connection. Creators who are not developers need to wait for tools like Riverside.fm, Descript, or CapCut to integrate these models, or work with a developer to build a custom integration. You can test without coding in the OpenAI Playground.
How much does it cost to transcribe a one-hour podcast?
GPT-Realtime-Whisper costs $0.017 per minute, so 60 minutes of audio costs approximately $1.02. That is comparable to batch transcription services, delivered in real time during recording rather than after the session ends.
What is the difference between GPT-Realtime-2 and the other two models?
GPT-Realtime-2 is a reasoning model: it handles voice conversations that require complex decision-making, multi-step task execution, and maintaining long context (128K tokens). GPT-Realtime-Whisper and GPT-Realtime-Translate are specialized input/output tools (transcription and translation). For most creators, Whisper and Translate are the immediate workflow additions; GPT-Realtime-2 is the backend for voice-controlled applications built by developers.
When will these models appear in tools I already use?
No integration partnerships have been announced. The models are live in the OpenAI Realtime API and Playground as of May 7, 2026. Expect third-party tools to announce support over the next several months as developers build integrations.