Latency Optimization

Latency is the single most important factor in voice agent quality. Users expect near-instant responses - anything over 1 second feels unnatural.

Understanding Voice Agent Latency

End-to-end latency is the time from when the user stops speaking to when they hear the bot's response:

User finishes speaking
        │
        ├── VAD End-of-Speech Detection ────── ~50ms
        │
        ├── Final STT Transcript ─────────── ~100-200ms
        │
        ├── LLM First Token ──────────────── ~150-300ms
        │
        ├── TTS First Audio Chunk ─────────── ~50-150ms
        │
        ├── Network/Encoding Overhead ─────── ~20-50ms
        │
        ▼
Bot starts speaking

Total: 370-750ms (target: <500ms)

Edesy vs Competitors

Platform	Typical E2E Latency	Our Advantage
Edesy	400-500ms	Optimized pipeline
Pipecat	600-800ms	-
Retell AI	700-1000ms	-
Vapi	800-1200ms	-
Bland AI	900-1500ms	-

Optimization Strategies

1. STT Streaming with Interim Results

Don't wait for the user to finish speaking - start processing interim transcripts:

// Traditional approach (slow)
func processTraditional(audio []byte) string {
    // Wait for complete transcript
    transcript := stt.TranscribeFull(audio) // 500ms+
    response := llm.Generate(transcript)    // 300ms+
    return response
}

// Optimized approach (fast)
func processOptimized(interimChan <-chan string, finalChan <-chan string) {
    var lastInterim string

    for {
        select {
        case interim := <-interimChan:
            lastInterim = interim
            // Pre-warm LLM context with interim
            llm.PrepareContext(interim)

        case final := <-finalChan:
            // LLM is already warmed up, faster generation
            response := llm.Generate(final)
            return response
        }
    }
}

Impact: -100-200ms

2. LLM Provider Selection

Choose the right LLM for your latency requirements:

Provider	Model	Time to First Token	Best For
Google	Gemini 2.5 Flash-Lite	~100ms	Fastest, simple tasks
Google	Gemini 2.0 Flash	~150ms	Good balance
OpenAI	GPT-4o-mini	~180ms	Quality + speed
Google	Gemini Live	~50ms	Native audio (no STT/TTS)
OpenAI	GPT-4o	~250ms	Complex reasoning

Recommendation: Use gemini-2.5-flash-lite for voice agents, switch to gpt-4o only for complex reasoning tasks.

3. TTS Streaming

Don't wait for complete audio - stream chunks as they're generated:

// Traditional (slow)
func synthesizeTraditional(text string) []byte {
    return tts.SynthesizeFull(text) // Wait for complete audio
}

// Streaming (fast)
func synthesizeStreaming(text string) <-chan []byte {
    return tts.StreamSynthesize(text) // Get chunks as ready
}

// In practice
for chunk := range tts.StreamSynthesize(response) {
    callProvider.SendAudio(chunk) // Send immediately
}

Impact: -100-300ms

4. Reduce VAD End-of-Speech Delay

Tune VAD parameters for faster turn detection:

// Aggressive settings (faster, might cut off)
cfg := VADConfig{
    Threshold:            0.7,  // Lower = more sensitive
    MinSilenceDurationMs: 150,  // Shorter = faster end detection
    VolumeThreshold:      0.02, // Filter background noise
}

// Conservative settings (safer, slower)
cfg := VADConfig{
    Threshold:            0.9,
    MinSilenceDurationMs: 300,
    VolumeThreshold:      0.0,
}

Impact: -50-150ms

5. Use Gemini Live for Native Audio

Bypass STT and TTS entirely with audio-to-audio models:

Traditional Pipeline:
Audio → STT (150ms) → LLM (200ms) → TTS (100ms) → Audio
Total: ~450ms

Gemini Live Pipeline:
Audio → Gemini Live (200ms) → Audio
Total: ~200ms

Impact: -200-300ms

6. Connection Pooling

Reuse connections instead of creating new ones per call:

// Connection pool for providers
type ProviderPool struct {
    sttPool  *ConnectionPool
    ttsPool  *ConnectionPool
    llmPool  *ConnectionPool
}

// Pre-warm connections on startup
func (p *ProviderPool) WarmUp() {
    p.sttPool.PreConnect(10)  // 10 warm STT connections
    p.ttsPool.PreConnect(10)  // 10 warm TTS connections
    p.llmPool.PreConnect(10)  // 10 warm LLM connections
}

Impact: -50-100ms on first message

7. Greeting Audio Caching

Pre-generate and cache the greeting audio:

// At agent creation time
func cacheGreeting(agent *Agent) {
    audio := tts.Synthesize(agent.GreetingMessage)
    cache.Set("greeting:"+agent.ID, audio, 24*time.Hour)
}

// At call start time
func playGreeting(agent *Agent, output chan []byte) {
    audio, found := cache.Get("greeting:" + agent.ID)
    if found {
        output <- audio // Instant playback
        return
    }
    // Fallback to real-time synthesis
    output <- tts.Synthesize(agent.GreetingMessage)
}

Impact: -100-200ms on first response

8. Prompt Optimization

Shorter prompts = faster LLM processing:

❌ Long prompt (slow):
"You are a customer support representative for Acme Corporation,
a leading provider of industrial equipment and supplies. Your role
is to assist customers with their orders, answer questions about
products, and help resolve any issues they may have. Always be
polite, professional, and helpful. If you don't know the answer,
say so and offer to connect them with a human agent..."
(500+ tokens)

✅ Optimized prompt (fast):
"You are Acme Corp support. Help with orders and products.
Be concise. Transfer to human if needed."
(50 tokens)

Impact: -50-100ms

9. Function Call Optimization

Pre-fetch data when intent is detected:

// When user says "What's my order status?"
func handleOrderStatusIntent(ctx context.Context, orderID string) {
    // Start fetching data immediately, don't wait for LLM
    go func() {
        data := fetchOrderStatus(orderID)
        ctx.SetPrefetchedData("order_status", data)
    }()

    // LLM will use prefetched data if available
    response := llm.Generate(ctx, messages)
}

Impact: -200-500ms for function calls

10. Regional Optimization

Deploy close to your telephony provider:

Your Region	Recommended Provider	Deploy To
India	Exotel	Mumbai (AWS ap-south-1)
US	Twilio	Virginia (AWS us-east-1)
Europe	Twilio	Frankfurt (AWS eu-central-1)

Impact: -20-100ms network latency

Measuring Latency

Track these metrics in production:

type LatencyMetrics struct {
    VADEndOfSpeech   time.Duration // Time to detect speech end
    STTFinal         time.Duration // Time to final transcript
    LLMFirstToken    time.Duration // Time to first LLM token
    LLMComplete      time.Duration // Time to complete response
    TTSFirstChunk    time.Duration // Time to first audio chunk
    E2ELatency       time.Duration // Total end-to-end
}

func (m *LatencyMetrics) Log() {
    log.Printf("Latency breakdown: VAD=%dms STT=%dms LLM=%dms TTS=%dms E2E=%dms",
        m.VADEndOfSpeech.Milliseconds(),
        m.STTFinal.Milliseconds(),
        m.LLMFirstToken.Milliseconds(),
        m.TTSFirstChunk.Milliseconds(),
        m.E2ELatency.Milliseconds(),
    )
}

Latency Budget

For a target of 500ms E2E:

Component	Budget	Optimization
VAD	50ms	Lower threshold
STT	120ms	Deepgram Nova-3 + streaming
LLM	200ms	Gemini 2.5 Flash-Lite
TTS	80ms	Cartesia + streaming
Network	50ms	Regional deployment
Total	500ms

Next Steps

VAD Configuration - Fine-tune voice detection
Provider Selection - Choose optimal providers
Monitoring - Track latency in production