Insights

How OpenAI Achieves Low Latency Voice AI at Global Scale

Published May 5, 2026

Updated May 10, 2026

How OpenAI Achieves Low Latency Voice AI at Global Scale

OpenAI Unveils Low-Latency Voice AI Infrastructure

Voice AI only feels natural if the conversation moves at the speed of human speech, yet achieving this at a global scale presents immense technical hurdles. OpenAI recently detailed its rearchitected WebRTC stack, designed to deliver crisp, real-time interactions for over 900 million weekly active users. By moving away from traditional media termination models, the engineering team has successfully minimized the awkward pauses and clipped interruptions that often plague network-dependent AI.

The core of this evolution is a shift from a standard "one-port-per-session" model to a sophisticated split relay and transceiver architecture. In the early days of ChatGPT voice, OpenAI utilized a single Go service built on the Pion library to handle both signaling and media termination. However, as the platform scaled, this approach collided with the realities of Kubernetes environments, where managing tens of thousands of public UDP ports became an operational nightmare.

To solve the "port exhaustion" problem, OpenAI engineers introduced a lightweight relay layer. This relay service, written in Go, acts as a high-performance UDP forwarder that does not actually decrypt media or participate in codec negotiation. Instead, it reads just enough of the initial STUN packet to identify the "ufrag" a protocol-native routing hook and then steers the packet to the correct stateful transceiver. This allows the entire fleet to sit behind a small, fixed set of public IP addresses while maintaining stable session ownership.

Performance was further bolstered through several Linux-level optimizations. The team utilized SO_REUSEPORT to distribute incoming UDP packets across multiple relay workers, avoiding the bottlenecks typically associated with a single read-loop. By pinning goroutines to specific OS threads using runtime.LockOSThread, OpenAI improved cache locality and reduced the overhead of context switching. These efficiency measures allow the system to process massive amounts of global media traffic with a relatively small relay footprint.

Beyond the internal routing logic, OpenAI deployed this pattern globally to tackle "first-hop" latency. By using geo-steering for signaling, a user's request is directed to a nearby transceiver cluster, while the media enters the network at a Global Relay ingress point close to the user's actual location. This reduces jitter and packet loss, ensuring that whether a user is in New York or Tokyo, the AI responds with the immediacy of a face-to-face conversatio

Found this helpful? Share it.