NEWS

OpenAI Ships Voice Models That Reason in Real Time

Person pointing at a glowing AI microphone icon with sound wave visualizations
TLDR

GPT-Realtime-2 brings GPT-5-class reasoning into live voice for the first time

On May 7, OpenAI launched three audio models through its Realtime API, moving real-time voice from simple call-and-response toward something that can reason, translate, and take action as a conversation unfolds.

The flagship model, GPT-Realtime-2, is OpenAI's first voice model to run GPT-5-class reasoning. Its context window expands from 32K to 128K tokens, making longer sessions and multi-step agentic workflows viable without external memory scaffolding. Developers can configure reasoning effort across five levels from minimal to xhigh, trading latency for depth depending on the task. The model supports parallel tool calls mid-conversation, can verbalize what it is doing ("checking your calendar," "looking that up now"), and recovers from failures with natural fallback language rather than breaking the conversation.

On benchmarks, GPT-Realtime-2 at high reasoning scores 15.2% higher on Big Bench Audio for audio intelligence than its predecessor, GPT-Realtime-1.5. On Audio MultiChallenge, which evaluates multi-turn conversational intelligence including instruction following and context integration, GPT-Realtime-2 at xhigh scores 13.8% higher than GPT-Realtime-1.5.

Early production results from enterprise partners are more concrete. Zillow tested GPT-Realtime-2 on a voice assistant for property searches and reported a 26-point lift in call success rate on its hardest adversarial benchmark, reaching 95% from 69%. BolnaAI, building voice agents for the Indian market, reported 12.5% lower Word Error Rates for Hindi, Tamil, and Telugu compared to any other model tested.

The Realtime API goes to production as the enterprise use case locks in

The two companion models address the parts of voice AI that reasoning alone cannot solve. GPT-Realtime-Translate handles live multilingual conversation, supporting 70+ input languages and 13 output languages at conversation speed. Deutsche Telekom is using it to build customer support where callers speak in their preferred language and the model translates in real time. GPT-Realtime-Whisper streams speech-to-text as the speaker talks, aimed at meeting transcription, live captions, and follow-up workflows in customer support, healthcare, and recruiting.

With this release, according to OpenAI's announcement, the Realtime API moves out of beta and into general availability for the first time since its initial release. GPT-Realtime-2 is priced at $32 per million audio input tokens and $64 per million audio output tokens. GPT-Realtime-Translate is $0.034 per minute. GPT-Realtime-Whisper is $0.017 per minute.

The practical constraint on voice AI has never been whether it could sound natural. It has been whether it could reason. GPT-Realtime-2 is the first signal that the reasoning layer has arrived in the voice stack, and the enterprise deployments going live around it suggest the production use case was waiting for exactly this.

Santage is committed to independent, transparent journalism. This article is produced in accordance with Santage's Editorial Standards and aims to provide accurate and timely information. Readers are encouraged to verify information independently.