BlogProduct
ProductFebruary 14, 20268 min read

Announcing the Concya Voice Engine v4: Sub-190ms Latency at Scale

The fastest conversational AI ever deployed in physical spaces.

OO
Olaoluwasubomi Olaoye
CEO & Founder
CONCYA.BLOG
EDITORIAL

In this article

When we started Concya, we made a bet: voice is the native interface for physical spaces. Not screens. Not apps. Voice. The challenge was that every voice AI system on the market felt like talking to a machine — stilted, slow, uncanny. We believed we could do better. Today, with the release of Voice Engine v4, we can say with confidence: we did.

Breaking the 200ms Barrier

End-to-end latency is the single most important metric in conversational AI. It's the difference between a conversation that flows and one that stutters. Human response time in natural conversation is roughly 250–300ms. Our previous generation engine achieved 280ms. Voice Engine v4 shatters this ceiling, delivering consistent sub-190ms latency in production environments.

190ms
End-to-End Latency
98.2%
STT Accuracy (Noisy)
94%
Booking Success Rate
100%
Call Handling Rate

Customers can't tell it's AI. That's the real win.

Operations Manager, Pilot Restaurant

The Pipeline Architecture

Achieving sub-200ms latency isn't about optimizing one component — it requires rethinking the entire pipeline. Voice Engine v4 operates as a tightly-coupled streaming system where every stage begins processing before the previous stage completes.

Stage 1: Real-Time Transcription

We partnered with Deepgram's Nova-2 model for speech-to-text, achieving ~50ms transcription latency with streaming output. The system begins sending partial transcripts as the user speaks, allowing downstream components to start processing before the user finishes their sentence.

Stage 2: Intent Processing

Our proprietary language model, fine-tuned on millions of hospitality conversations, processes intent in ~100ms. Unlike general-purpose LLMs, our model is trained specifically to understand restaurant terminology, handle corrections mid-sentence, and maintain multi-turn context with perfect recall.

Stage 3: Voice Synthesis

Voice synthesis through Cartesia Sonic-3 delivers the first audio chunk in ~60ms. But the real innovation is in the prosody engine — we've developed a system that introduces natural micro-pauses, breathing patterns, and emotional inflection that make the voice indistinguishable from a human hostess.

text
User Speech → Transcript:    ~50ms  (Deepgram STT)
Transcript → LLM Token:      ~100ms (vLLM first token)
LLM Token → Audio Chunk:     ~60ms  (Cartesia TTS)
Audio Chunk → User Hears:    ~40ms  (Network + Playback)
────────────────────────────────────
TOTAL (First Response):      ~250ms
PERCEIVED (Streaming):       ~190ms

Why Latency Matters More Than Quality

This is a counterintuitive truth we've learned from deploying across hundreds of locations: users are more forgiving of a slightly imperfect response that comes instantly than a perfect response that comes late. The uncanny valley of voice AI isn't about how the voice sounds — it's about the timing. A 200ms gap feels like a conversation. A 500ms gap feels like talking to a machine.

The uncanny valley of voice AI isn't about how the voice sounds — it's about the timing.

What's Next

Voice Engine v4 is already live across our RSRVE restaurant network and ATLAS public infrastructure deployments. In the coming months, we'll be releasing our multi-language expansion (starting with Spanish, Mandarin, and Yoruba), edge deployment capabilities for on-premise hardware, and our open-source ASR model weights for the research community.

We're just getting started. The operating system for physical spaces needs a voice — and we're giving it one that feels unmistakably human.

OO
Olaoluwasubomi Olaoye
CEO & Founder

Building the operating system for physical spaces. Previously at the intersection of voice AI and infrastructure.