A customer on a live call experienced 1.4 seconds of dead air after finishing a sentence—long enough for her to say "hello?" before the agent responded. Within 40 seconds, the issue was traced in Honeycomb, yet every span appeared green. End-to-end p95 latency clocked in at 980ms, comfortably under budget, with no single span exceeding 400ms. The dashboard insisted everything was fine, even as the customer waited in silence.
The disconnect highlights a critical flaw in how voice systems are monitored. Traditional Application Performance Monitoring (APM) tools excel at tracking computational work but overlook the gaps between system components. In voice agents, these unattributed delays often occur during transitions between stages, such as from voice activity detection (VAD) to automatic speech recognition (ASR). Without dedicated instrumentation, these handoffs remain invisible, allowing performance issues to persist despite healthy-looking metrics.
The hidden latency budget in voice systems
Voice agent workflows involve multiple stages, each with its own performance characteristics. A typical turn begins with VAD detecting when a user stops speaking, followed by streaming the audio to ASR (e.g., Whisper Large v3), then processing the transcript through an LLM (e.g., gpt-4o-realtime), and finally generating a response via text-to-speech (TTS, e.g., ElevenLabs). Network latency also plays a role on both ends of the pipeline.
Analyzing a trace from the problematic call reveals a misleading latency budget when summed from individual spans:
- VAD / turn-detection: p50 60ms, p95 120ms, p99 180ms
- ASR (streaming): p50 180ms, p95 310ms, p99 540ms
- LLM time-to-first-token (TTFT): p50 220ms, p95 380ms, p99 720ms
- LLM full response: p50 140ms, p95 260ms, p99 430ms
- TTS first byte: p50 90ms, p95 190ms, p99 360ms
- Network (both legs): p50 40ms, p95 90ms, p99 150ms
Summing the p95 values yields roughly 1,340ms, though the actual end-to-end p95 was reported as 980ms. While aggregates like p95 don’t stack perfectly, the real issue lies elsewhere: the 1.4-second gap wasn’t captured in any span. This dead air occurred in the transition between VAD/turn-detection and ASR initiation—a handoff that traditional tracing tools ignore.
Why traditional tracing fails voice systems
In a standard tracing interface, voice turns appear as waterfalls of colored bars, each representing a span. The natural instinct is to optimize the longest bar, assuming it indicates the bottleneck. In this case, efforts like improving ASR speed or reducing LLM time-to-first-token latency shaved milliseconds off individual stages, yet the dead air persisted. The issue wasn’t in the visible spans but in the whitespace between them—the handoff from turn-detection to ASR.
A whiteboard sketch finally illustrated the problem: the spans started 1,400ms after the user stopped talking. The traced stages were accurate and efficient, but the critical gap between turn-end and ASR-start went unmeasured. In production, this handoff relied on a queue that handed off audio to ASR, which lazily established a streaming connection on first use. Under concurrent load, connection setup conflicts arose, as the pool size was optimized for steady-state traffic, not bursts of six calls ending turns within 200ms. The result? A silent stall that lasted 1.4 seconds.
Instrumenting the missing handoff
The solution involves two steps: measuring the handoff and addressing the root cause. First, a dedicated span must capture the transition from VAD/turn-detection to ASR initiation. This ensures the gap stops being invisible in dashboards. Second, the connection pool must be adjusted to handle burst loads, preventing contention during high-concurrency events.
Here’s how the fix looks in OpenTelemetry Python code:
from opentelemetry import trace
from opentelemetry.trace import SpanKind, Status, StatusCode
tracer = trace.get_tracer("voice.turn")
async def handle_turn(audio_in, ctx):
# Outer span covers the entire turn, anchored at turn-end
with tracer.start_as_current_span(
"voice.turn",
kind=SpanKind.SERVER,
) as turn_span:
turn_span.set_attribute("call.id", ctx.call_id)
turn_span.set_attribute("turn.index", ctx.turn_index)
# The missing span: VAD/turn-detection to ASR-start handoff
with tracer.start_as_current_span("voice.handoff.vad_to_asr") as hs:
hs.set_attribute("handoff.from", "turn_detection")
hs.set_attribute("handoff.to", "asr")
try:
asr_stream = await asr_client.open_stream(ctx)
except Exception as exc:
hs.set_status(Status(StatusCode.ERROR, str(exc)))
hs.record_exception(exc)
raise
# Mark when audio begins flowing into ASR
hs.add_event("asr_stream_ready")This instrumentation ensures the handoff duration is tracked, making invisible delays visible. Without this step, voice systems risk delivering subpar user experiences while dashboards report healthy performance metrics.
Voice AI systems demand more than traditional APM. They require instrumentation that captures the gaps between components, where real-world latency often hides. By addressing these blind spots, teams can ensure their voice agents feel responsive—not just on paper, but in practice.
AI summary
Discover how traditional APM tools overlook hidden latency in voice AI pipelines, causing silent delays that hurt user experience despite healthy-looking metrics.