Latency Optimization
Latency is the single most important factor in voice agent quality. Users expect near-instant responses - anything over 1 second feels unnatural.
Understanding Voice Agent Latency
End-to-end latency is the time from when the user stops speaking to when they hear the bot's response:
User finishes speaking
│
├── VAD End-of-Speech Detection ────── ~50ms
│
├── Final STT Transcript ─────────── ~100-200ms
│
├── LLM First Token ──────────────── ~150-300ms
│
├── TTS First Audio Chunk ─────────── ~50-150ms
│
├── Network/Encoding Overhead ─────── ~20-50ms
│
▼
Bot starts speaking
Total: 370-750ms (target: <500ms)
Edesy vs Competitors
| Platform | Typical E2E Latency | Our Advantage |
|---|---|---|
| Edesy | 400-500ms | Optimized pipeline |
| Pipecat | 600-800ms | - |
| Retell AI | 700-1000ms | - |
| Vapi | 800-1200ms | - |
| Bland AI | 900-1500ms | - |
Optimization Strategies
1. STT Streaming with Interim Results
Don't wait for the user to finish speaking - start processing interim transcripts:
// Traditional approach (slow)
func processTraditional(audio []byte) string {
// Wait for complete transcript
transcript := stt.TranscribeFull(audio) // 500ms+
response := llm.Generate(transcript) // 300ms+
return response
}
// Optimized approach (fast)
func processOptimized(interimChan <-chan string, finalChan <-chan string) {
var lastInterim string
for {
select {
case interim := <-interimChan:
lastInterim = interim
// Pre-warm LLM context with interim
llm.PrepareContext(interim)
case final := <-finalChan:
// LLM is already warmed up, faster generation
response := llm.Generate(final)
return response
}
}
}
Impact: -100-200ms
2. LLM Provider Selection
Choose the right LLM for your latency requirements:
| Provider | Model | Time to First Token | Best For |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | ~100ms | Fastest, simple tasks | |
| Gemini 2.0 Flash | ~150ms | Good balance | |
| OpenAI | GPT-4o-mini | ~180ms | Quality + speed |
| Gemini Live | ~50ms | Native audio (no STT/TTS) | |
| OpenAI | GPT-4o | ~250ms | Complex reasoning |
Recommendation: Use gemini-2.5-flash-lite for voice agents, switch to gpt-4o only for complex reasoning tasks.
3. TTS Streaming
Don't wait for complete audio - stream chunks as they're generated:
// Traditional (slow)
func synthesizeTraditional(text string) []byte {
return tts.SynthesizeFull(text) // Wait for complete audio
}
// Streaming (fast)
func synthesizeStreaming(text string) <-chan []byte {
return tts.StreamSynthesize(text) // Get chunks as ready
}
// In practice
for chunk := range tts.StreamSynthesize(response) {
callProvider.SendAudio(chunk) // Send immediately
}
Impact: -100-300ms
4. Reduce VAD End-of-Speech Delay
Tune VAD parameters for faster turn detection:
// Aggressive settings (faster, might cut off)
cfg := VADConfig{
Threshold: 0.7, // Lower = more sensitive
MinSilenceDurationMs: 150, // Shorter = faster end detection
VolumeThreshold: 0.02, // Filter background noise
}
// Conservative settings (safer, slower)
cfg := VADConfig{
Threshold: 0.9,
MinSilenceDurationMs: 300,
VolumeThreshold: 0.0,
}
Impact: -50-150ms
5. Use Gemini Live for Native Audio
Bypass STT and TTS entirely with audio-to-audio models:
Traditional Pipeline:
Audio → STT (150ms) → LLM (200ms) → TTS (100ms) → Audio
Total: ~450ms
Gemini Live Pipeline:
Audio → Gemini Live (200ms) → Audio
Total: ~200ms
Impact: -200-300ms
6. Connection Pooling
Reuse connections instead of creating new ones per call:
// Connection pool for providers
type ProviderPool struct {
sttPool *ConnectionPool
ttsPool *ConnectionPool
llmPool *ConnectionPool
}
// Pre-warm connections on startup
func (p *ProviderPool) WarmUp() {
p.sttPool.PreConnect(10) // 10 warm STT connections
p.ttsPool.PreConnect(10) // 10 warm TTS connections
p.llmPool.PreConnect(10) // 10 warm LLM connections
}
Impact: -50-100ms on first message
7. Greeting Audio Caching
Pre-generate and cache the greeting audio:
// At agent creation time
func cacheGreeting(agent *Agent) {
audio := tts.Synthesize(agent.GreetingMessage)
cache.Set("greeting:"+agent.ID, audio, 24*time.Hour)
}
// At call start time
func playGreeting(agent *Agent, output chan []byte) {
audio, found := cache.Get("greeting:" + agent.ID)
if found {
output <- audio // Instant playback
return
}
// Fallback to real-time synthesis
output <- tts.Synthesize(agent.GreetingMessage)
}
Impact: -100-200ms on first response
8. Prompt Optimization
Shorter prompts = faster LLM processing:
❌ Long prompt (slow):
"You are a customer support representative for Acme Corporation,
a leading provider of industrial equipment and supplies. Your role
is to assist customers with their orders, answer questions about
products, and help resolve any issues they may have. Always be
polite, professional, and helpful. If you don't know the answer,
say so and offer to connect them with a human agent..."
(500+ tokens)
✅ Optimized prompt (fast):
"You are Acme Corp support. Help with orders and products.
Be concise. Transfer to human if needed."
(50 tokens)
Impact: -50-100ms
9. Function Call Optimization
Pre-fetch data when intent is detected:
// When user says "What's my order status?"
func handleOrderStatusIntent(ctx context.Context, orderID string) {
// Start fetching data immediately, don't wait for LLM
go func() {
data := fetchOrderStatus(orderID)
ctx.SetPrefetchedData("order_status", data)
}()
// LLM will use prefetched data if available
response := llm.Generate(ctx, messages)
}
Impact: -200-500ms for function calls
10. Regional Optimization
Deploy close to your telephony provider:
| Your Region | Recommended Provider | Deploy To |
|---|---|---|
| India | Exotel | Mumbai (AWS ap-south-1) |
| US | Twilio | Virginia (AWS us-east-1) |
| Europe | Twilio | Frankfurt (AWS eu-central-1) |
Impact: -20-100ms network latency
Measuring Latency
Track these metrics in production:
type LatencyMetrics struct {
VADEndOfSpeech time.Duration // Time to detect speech end
STTFinal time.Duration // Time to final transcript
LLMFirstToken time.Duration // Time to first LLM token
LLMComplete time.Duration // Time to complete response
TTSFirstChunk time.Duration // Time to first audio chunk
E2ELatency time.Duration // Total end-to-end
}
func (m *LatencyMetrics) Log() {
log.Printf("Latency breakdown: VAD=%dms STT=%dms LLM=%dms TTS=%dms E2E=%dms",
m.VADEndOfSpeech.Milliseconds(),
m.STTFinal.Milliseconds(),
m.LLMFirstToken.Milliseconds(),
m.TTSFirstChunk.Milliseconds(),
m.E2ELatency.Milliseconds(),
)
}
Latency Budget
For a target of 500ms E2E:
| Component | Budget | Optimization |
|---|---|---|
| VAD | 50ms | Lower threshold |
| STT | 120ms | Deepgram Nova-3 + streaming |
| LLM | 200ms | Gemini 2.5 Flash-Lite |
| TTS | 80ms | Cartesia + streaming |
| Network | 50ms | Regional deployment |
| Total | 500ms |
Next Steps
- VAD Configuration - Fine-tune voice detection
- Provider Selection - Choose optimal providers
- Monitoring - Track latency in production