TTS Providers Overview

Text-to-Speech (TTS) converts the LLM's response into natural-sounding audio. The right TTS provider creates a seamless, human-like conversation experience.

Supported Providers

Provider	Quality	Latency	Languages	Best For
Cartesia	⭐⭐⭐⭐⭐	⚡ Fastest	50+	Low-latency voice agents
ElevenLabs	⭐⭐⭐⭐⭐	🚀 Fast	30+	Premium voice quality
Google	⭐⭐⭐⭐	🚀 Fast	50+	Multilingual
Azure	⭐⭐⭐⭐⭐	🚀 Fast	100+	Enterprise + Indic
OpenAI	⭐⭐⭐⭐	🚀 Fast	57+	Simple integration
Deepgram	⭐⭐⭐⭐	⚡ Fastest	30+	Aura voices

Quick Comparison

Time to First Audio Chunk (lower is better):
──────────────────────────────────────────────────────────────────

Cartesia Sonic    ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  50ms
Deepgram Aura     ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  60ms
Google WaveNet    ██████████░░░░░░░░░░░░░░░░░░░░░░  100ms
Azure Neural      ████████████░░░░░░░░░░░░░░░░░░░░  120ms
ElevenLabs        ████████████████░░░░░░░░░░░░░░░░  150ms
OpenAI TTS        ██████████████████░░░░░░░░░░░░░░  180ms

                  0ms              100ms            200ms

Choosing the Right Provider

For Low Latency (Recommended)

Cartesia Sonic

Fastest time-to-first-audio (~50ms)
High-quality neural voices
Excellent streaming support
Optimized for real-time

{
  "ttsProvider": "cartesia",
  "ttsVoice": "95856005-0332-41b0-935f-352e296aa0df"
}

For Premium Quality

ElevenLabs

Most natural-sounding voices
Emotion and style control
Voice cloning available
Best for premium experiences

{
  "ttsProvider": "elevenlabs",
  "ttsVoice": "21m00Tcm4TlvDq8ikWAM"
}

For Indic Languages

Azure Neural

Best quality for Hindi, Tamil, Telugu, Bengali
Excellent Assamese support
Regional accent options

{
  "ttsProvider": "azure",
  "ttsVoice": "hi-IN-SwaraNeural"
}

For Enterprise

Azure Neural or Google WaveNet

Enterprise SLAs
Data residency options
Custom voice training
SOC 2, HIPAA compliant

Streaming Architecture

TTS streaming is critical for low latency:

Non-Streaming (Slow):
─────────────────────────────────────────────────────────────
LLM: "Your order has been shipped..."
      │
      └── Wait for complete text ──────────────────────────┐
                                                           │
TTS: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ [Generate all] ──┤
                                                           │
Audio: ────────────────────────────── [Play all at once] ──┘
Total: 500ms+ delay

Streaming (Fast):
─────────────────────────────────────────────────────────────
LLM:   "Your" → "order" → "has" → "been" → "shipped..."
         │        │        │        │         │
TTS:    [Gen]    [Gen]    [Gen]    [Gen]     [Gen]
         │        │        │        │         │
Audio:  [Play]   [Play]   [Play]   [Play]    [Play]
         └─── 50ms ──┘

Implementation

// Stream TTS as LLM tokens arrive
func streamTTS(llmOutput <-chan string, audioOutput chan<- []byte) {
    var buffer strings.Builder

    for token := range llmOutput {
        buffer.WriteString(token)

        // Flush buffer at sentence boundaries
        if endsWithPunctuation(buffer.String()) {
            text := buffer.String()
            buffer.Reset()

            // Stream audio chunks
            for chunk := range tts.StreamSynthesize(text) {
                audioOutput <- chunk
            }
        }
    }

    // Flush remaining text
    if buffer.Len() > 0 {
        for chunk := range tts.StreamSynthesize(buffer.String()) {
            audioOutput <- chunk
        }
    }
}

Audio Specifications

Parameter	Requirement	Notes
Sample Rate	8000 Hz	Telephony (μ-law)
Channels	Mono	Single channel
Bit Depth	16-bit	Linear PCM
Output	μ-law or PCM	Provider dependent

Audio Conversion for Telephony

// TTS outputs 24kHz, telephony needs 8kHz μ-law
func convertForTelephony(input []byte, inputRate int) []byte {
    // Downsample to 8kHz
    resampled := downsample(input, inputRate, 8000)

    // Convert to μ-law
    mulaw := pcmToMulaw(resampled)

    return mulaw
}

Voice Selection Guide

By Use Case

Use Case	Recommended Voice Type
Customer Support	Warm, friendly, moderate pace
Sales	Energetic, confident
Healthcare	Calm, clear, reassuring
Banking	Professional, trustworthy
Entertainment	Dynamic, expressive

Voice Characteristics

type VoiceProfile struct {
    Gender     string   // male, female, neutral
    Age        string   // young, adult, senior
    Tone       string   // warm, professional, casual
    Pace       string   // slow, moderate, fast
    Pitch      string   // low, medium, high
    Languages  []string // Supported languages
}

// Example profiles
var SupportVoice = VoiceProfile{
    Gender: "female",
    Age:    "adult",
    Tone:   "warm",
    Pace:   "moderate",
    Pitch:  "medium",
}

Cost Comparison

Provider	Cost per 1K chars	Monthly (10M chars)
Cartesia	$0.015	$150
Deepgram	$0.015	$150
Google WaveNet	$0.016	$160
Azure Neural	$0.016	$160
OpenAI TTS	$0.015	$150
ElevenLabs	$0.18	$1,800

SSML Support

Speech Synthesis Markup Language for fine control:

<!-- Pause -->
<speak>
  Your order number is <break time="500ms"/> 1 2 3 4 5.
</speak>

<!-- Pronunciation -->
<speak>
  <phoneme alphabet="ipa" ph="ˈɛdəsi">Edesy</phoneme>
  helps you build voice agents.
</speak>

<!-- Emphasis -->
<speak>
  Your order is <emphasis level="strong">confirmed</emphasis>.
</speak>

<!-- Prosody (speed, pitch, volume) -->
<speak>
  <prosody rate="slow" pitch="+5%">
    Please speak clearly after the beep.
  </prosody>
</speak>

Provider SSML Support

Feature	Cartesia	ElevenLabs	Google	Azure
Pauses	✅	✅	✅	✅
Pronunciation	❌	❌	✅	✅
Emphasis	❌	✅	✅	✅
Prosody	❌	✅	✅	✅
Say-as (dates, numbers)	❌	❌	✅	✅

Caching Strategy

Cache frequently used phrases:

type TTSCache struct {
    cache map[string][]byte
    mu    sync.RWMutex
}

func (c *TTSCache) GetOrGenerate(text string, voice string) []byte {
    key := fmt.Sprintf("%s:%s", voice, hash(text))

    c.mu.RLock()
    if audio, ok := c.cache[key]; ok {
        c.mu.RUnlock()
        return audio
    }
    c.mu.RUnlock()

    // Generate and cache
    audio := tts.Synthesize(text, voice)

    c.mu.Lock()
    c.cache[key] = audio
    c.mu.Unlock()

    return audio
}

// Pre-cache common phrases
func preCacheGreetings(agent *Agent) {
    phrases := []string{
        agent.GreetingMessage,
        "One moment please.",
        "I'm looking that up for you.",
        "Is there anything else I can help with?",
        "Thank you for calling. Goodbye!",
    }

    for _, phrase := range phrases {
        cache.GetOrGenerate(phrase, agent.TTSVoice)
    }
}

Error Handling

func (t *TTS) synthesizeWithFallback(text string) ([]byte, error) {
    // Try primary provider
    audio, err := t.primary.Synthesize(text)
    if err == nil {
        return audio, nil
    }

    log.Printf("Primary TTS failed: %v, trying fallback", err)

    // Try fallback provider
    audio, err = t.fallback.Synthesize(text)
    if err != nil {
        return nil, fmt.Errorf("all TTS providers failed: %w", err)
    }

    return audio, nil
}

Next Steps

Cartesia Configuration - Fastest TTS
ElevenLabs Configuration - Premium quality
Azure Configuration - Enterprise + Indic
Voice Selection Guide - Choose the right voice