xAI launched the Grok Voice Agent API on April 18, 2026, adding standalone speech-to-text and text-to-speech endpoints priced at $0.10 per hour for batch STT and $4.20 per million characters for TTS. The release gives developers the same audio stack that powers Grok Voice, Tesla in-car assistance, and Starlink support, with five voices, tool calling, and drop-in compatibility with the OpenAI Realtime API.

For the broader landscape, see our complete producer guide to AI music and audio in 2026.

What Happened

xAI opened public access to three audio APIs on Saturday: a Voice Agent API for building real-time voice bots, a standalone speech-to-text endpoint, and a text-to-speech endpoint. The Voice Agent API routes audio through a single voice-to-voice pipeline with sub-700 millisecond response latency and handles interruption detection server-side using voice activity detection. Developers can plug the models into LiveKit, Twilio, WebRTC, Voximplant, or Pipecat through ready-made templates, or consume them directly via the xAI HTTP API.

The product is documented at docs.x.ai/developers/model-capabilities/audio/voice. It launches with five voices, three female presets (Eve, Ara) and three male presets (Rex, Sal, Leo), plus expressive speech tags that let creators insert laughter, whispers, and pauses inside a single TTS prompt.

Why It Matters

The pricing undercuts ElevenLabs and matches Deepgram on the STT side while beating both on TTS for high-volume narration. At $4.20 per million characters, a 10,000-word podcast script costs roughly four cents to voice. For creators building voice-first products, custom chatbots, or automated dubbing pipelines, xAI now sits next to ElevenLabs on-device voice and the larger voice-cloning field as a serious production option.

The OpenAI Realtime API spec compatibility is the bigger unlock. Teams already shipping on GPT-4o voice can swap the Grok endpoint in with minimal code changes, which creates genuine pricing pressure on the incumbents.

Key Details

STT: $0.10 per hour for batch transcription, $0.20 per hour for streaming. Word-level timestamps, multispeaker diarization, multichannel audio, and text formatting. Supports 25 languages.

TTS: $4.20 per million characters. 20-plus languages. Five voices. Expressive speech tags for laughter, whispers, and pauses. Voice tags let creators script emotional delivery inside the prompt.

Voice Agent API: Sub-700 millisecond end-to-end latency, function calling for CRMs, calendars, and REST or GraphQL endpoints, plus prebuilt tools for realtime web search and X post lookups. Server-side voice activity detection handles turn-taking and barge-ins without client logic.

Integrations: LiveKit plugin for Python with Node support planned. Ready-made templates for Twilio, WebRTC, Voximplant, and Pipecat. OpenAI Realtime API compatible wire format.

Compliance: SOC 2 Type II, HIPAA eligible, GDPR compliant. Multi-region infrastructure with custom SLAs for enterprise customers.

Benchmarks: Internal benchmarks cited by xAI show the models edging past ElevenLabs and Deepgram on accuracy and latency, though independent comparisons are not yet public.

What to Do Next

Creators running paid voice stacks for narration, YouTube dubs, audiobook production, or customer support bots should run a side-by-side test before the next billing cycle. The OpenAI Realtime API compatibility means swapping Grok into an existing agent requires a single base URL change. For new projects, LiveKit's Python plugin is the fastest path to a working prototype, and the expressive speech tags in TTS are worth experimenting with on podcast intros or AI character voice work.

The audio capabilities docs are live now at. Usage-based billing applies from launch, and API keys work immediately with no waitlist.