How Real-Time Voice Translation Works: The Tech Behind Under-1-Second Latency
Most translation apps feel like sending a postcard: you write your message, send it, and wait for a reply. Real-time voice translation is more like a phone call — continuous, responsive, and fast enough that the other person doesn't notice any delay.
Here's how that's technically possible, and why older architectures couldn't do it.
Why traditional translation apps are slow
A conventional translation app on your phone follows a sequential pipeline:
You finish speaking and tap a button
The audio recording is compressed and sent to a speech recognition service
The text transcript is returned
The transcript is sent to a translation API
The translated text is returned
The translated text is sent to a text-to-speech service
The audio file is returned and played
Each step adds latency. Depending on your connection and the services involved, this pipeline typically takes 3–8 seconds per phrase. That's fine for translating a written document. For live spoken conversation, it's too slow to feel natural — you're constantly aware of the mechanics.
The WebSocket streaming architecture
Speasy uses Google Gemini's Live API, which works through a persistent WebSocket connection rather than discrete HTTP requests.
A WebSocket is a type of network connection that stays open between your phone and the server — unlike a standard HTTP request, which opens, delivers data, and closes. With a persistent connection, data can flow continuously in both directions without the overhead of establishing a new connection for each exchange.
In practice, this means audio is streamed to Gemini in real time as you speak, rather than being recorded, batched, and sent after you finish. Gemini processes the stream continuously.
Parallel processing: the key to sub-second latency
The other critical difference is that Google Gemini handles speech recognition, translation, and speech synthesis as a unified pipeline rather than sequential steps.
In a traditional system, step 2 (recognition) must complete before step 4 (translation) can start, which must complete before step 6 (synthesis) can start. Each step waits for the previous one.
Gemini's Live API processes all three stages in parallel as the audio arrives. As speech is being recognised, translation is being generated. As the translation is being generated, synthesis (the AI speaking the result) has already begun. By the time you've finished your sentence, the translated speech is already playing.
The result is end-to-end latency under 1 second — typically 600–900ms in normal network conditions.
What happens on the phone during this
On the device side, Speasy is continuously capturing PCM audio from the microphone at 16kHz — a standard sampling rate for voice — and encoding it in 100ms chunks. These chunks are base64-encoded and sent over the WebSocket as fast as they're captured.
When Gemini returns translated audio, it arrives as a stream of PCM audio data which Speasy immediately queues for playback. From your perspective, the translation starts playing almost as soon as you stop speaking.
Meanwhile, the app also receives text transcriptions of both the original speech and the translation, which are displayed on screen. These are generated as a byproduct of the same pipeline — no additional API call required.
How the AI knows which language to translate from
Speasy doesn't require you to manually switch between languages during a conversation. When you set up a session, you specify two languages (e.g. English and Spanish). The Gemini model is instructed to detect which of those two languages is being spoken and translate to the other one automatically.
Gemini's language detection at this level is fast and accurate for the 42 languages Speasy supports. It handles mid-sentence language switching (common in multilingual families or when one speaker knows a bit of the other language) without confusion.
What the model is actually doing
Gemini Live processes audio directly rather than converting to text first. This is different from older architectures where speech recognition produced a transcript, which was then translated as text. Processing audio natively means Gemini preserves elements of speech that text doesn't capture: tone, emphasis, speed, hesitation. These cues affect meaning and register — a hesitant question sounds different from a confident statement, even if the words are the same.
The result is translation that's closer to how a human interpreter works: reading not just the words but the delivery.
What limits it
Two things: network quality and background noise.
WebSocket streaming requires a stable connection. On poor mobile data (edge network, weak hotel Wi-Fi), the stream can lag or drop entirely. Speasy reconnects automatically when the connection recovers, but the translation won't work offline.
Background noise affects the input signal. Gemini's speech recognition is robust in normal conditions, but in a very loud environment (stadium, club, heavy machinery), the recognition accuracy drops. For noisy situations, typed input is more reliable.
See it in action — 3 free minutes, no card needed
Speasy uses Google Gemini Live API on iPhone. 42 languages. Start immediately. Download Free on the App Store