TTS Providers Overview
Text-to-Speech (TTS) converts the LLM's response into natural-sounding audio. The right TTS provider creates a seamless, human-like conversation experience.
Supported Providers
| Provider | Quality | Latency | Languages | Best For |
|---|---|---|---|---|
| Cartesia | ⭐⭐⭐⭐⭐ | ⚡ Fastest | 50+ | Low-latency voice agents |
| ElevenLabs | ⭐⭐⭐⭐⭐ | 🚀 Fast | 30+ | Premium voice quality |
| ⭐⭐⭐⭐ | 🚀 Fast | 50+ | Multilingual | |
| Azure | ⭐⭐⭐⭐⭐ | 🚀 Fast | 100+ | Enterprise + Indic |
| OpenAI | ⭐⭐⭐⭐ | 🚀 Fast | 57+ | Simple integration |
| Deepgram | ⭐⭐⭐⭐ | ⚡ Fastest | 30+ | Aura voices |
Quick Comparison
Time to First Audio Chunk (lower is better):
──────────────────────────────────────────────────────────────────
Cartesia Sonic ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 50ms
Deepgram Aura ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 60ms
Google WaveNet ██████████░░░░░░░░░░░░░░░░░░░░░░ 100ms
Azure Neural ████████████░░░░░░░░░░░░░░░░░░░░ 120ms
ElevenLabs ████████████████░░░░░░░░░░░░░░░░ 150ms
OpenAI TTS ██████████████████░░░░░░░░░░░░░░ 180ms
0ms 100ms 200ms
Choosing the Right Provider
For Low Latency (Recommended)
Cartesia Sonic
- Fastest time-to-first-audio (~50ms)
- High-quality neural voices
- Excellent streaming support
- Optimized for real-time
{
"ttsProvider": "cartesia",
"ttsVoice": "95856005-0332-41b0-935f-352e296aa0df"
}
For Premium Quality
ElevenLabs
- Most natural-sounding voices
- Emotion and style control
- Voice cloning available
- Best for premium experiences
{
"ttsProvider": "elevenlabs",
"ttsVoice": "21m00Tcm4TlvDq8ikWAM"
}
For Indic Languages
Azure Neural
- Best quality for Hindi, Tamil, Telugu, Bengali
- Excellent Assamese support
- Regional accent options
{
"ttsProvider": "azure",
"ttsVoice": "hi-IN-SwaraNeural"
}
For Enterprise
Azure Neural or Google WaveNet
- Enterprise SLAs
- Data residency options
- Custom voice training
- SOC 2, HIPAA compliant
Streaming Architecture
TTS streaming is critical for low latency:
Non-Streaming (Slow):
─────────────────────────────────────────────────────────────
LLM: "Your order has been shipped..."
│
└── Wait for complete text ──────────────────────────┐
│
TTS: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ [Generate all] ──┤
│
Audio: ────────────────────────────── [Play all at once] ──┘
Total: 500ms+ delay
Streaming (Fast):
─────────────────────────────────────────────────────────────
LLM: "Your" → "order" → "has" → "been" → "shipped..."
│ │ │ │ │
TTS: [Gen] [Gen] [Gen] [Gen] [Gen]
│ │ │ │ │
Audio: [Play] [Play] [Play] [Play] [Play]
└─── 50ms ──┘
Implementation
// Stream TTS as LLM tokens arrive
func streamTTS(llmOutput <-chan string, audioOutput chan<- []byte) {
var buffer strings.Builder
for token := range llmOutput {
buffer.WriteString(token)
// Flush buffer at sentence boundaries
if endsWithPunctuation(buffer.String()) {
text := buffer.String()
buffer.Reset()
// Stream audio chunks
for chunk := range tts.StreamSynthesize(text) {
audioOutput <- chunk
}
}
}
// Flush remaining text
if buffer.Len() > 0 {
for chunk := range tts.StreamSynthesize(buffer.String()) {
audioOutput <- chunk
}
}
}
Audio Specifications
| Parameter | Requirement | Notes |
|---|---|---|
| Sample Rate | 8000 Hz | Telephony (μ-law) |
| Channels | Mono | Single channel |
| Bit Depth | 16-bit | Linear PCM |
| Output | μ-law or PCM | Provider dependent |
Audio Conversion for Telephony
// TTS outputs 24kHz, telephony needs 8kHz μ-law
func convertForTelephony(input []byte, inputRate int) []byte {
// Downsample to 8kHz
resampled := downsample(input, inputRate, 8000)
// Convert to μ-law
mulaw := pcmToMulaw(resampled)
return mulaw
}
Voice Selection Guide
By Use Case
| Use Case | Recommended Voice Type |
|---|---|
| Customer Support | Warm, friendly, moderate pace |
| Sales | Energetic, confident |
| Healthcare | Calm, clear, reassuring |
| Banking | Professional, trustworthy |
| Entertainment | Dynamic, expressive |
Voice Characteristics
type VoiceProfile struct {
Gender string // male, female, neutral
Age string // young, adult, senior
Tone string // warm, professional, casual
Pace string // slow, moderate, fast
Pitch string // low, medium, high
Languages []string // Supported languages
}
// Example profiles
var SupportVoice = VoiceProfile{
Gender: "female",
Age: "adult",
Tone: "warm",
Pace: "moderate",
Pitch: "medium",
}
Cost Comparison
| Provider | Cost per 1K chars | Monthly (10M chars) |
|---|---|---|
| Cartesia | $0.015 | $150 |
| Deepgram | $0.015 | $150 |
| Google WaveNet | $0.016 | $160 |
| Azure Neural | $0.016 | $160 |
| OpenAI TTS | $0.015 | $150 |
| ElevenLabs | $0.18 | $1,800 |
SSML Support
Speech Synthesis Markup Language for fine control:
<!-- Pause -->
<speak>
Your order number is <break time="500ms"/> 1 2 3 4 5.
</speak>
<!-- Pronunciation -->
<speak>
<phoneme alphabet="ipa" ph="ˈɛdəsi">Edesy</phoneme>
helps you build voice agents.
</speak>
<!-- Emphasis -->
<speak>
Your order is <emphasis level="strong">confirmed</emphasis>.
</speak>
<!-- Prosody (speed, pitch, volume) -->
<speak>
<prosody rate="slow" pitch="+5%">
Please speak clearly after the beep.
</prosody>
</speak>
Provider SSML Support
| Feature | Cartesia | ElevenLabs | Azure | |
|---|---|---|---|---|
| Pauses | ✅ | ✅ | ✅ | ✅ |
| Pronunciation | ❌ | ❌ | ✅ | ✅ |
| Emphasis | ❌ | ✅ | ✅ | ✅ |
| Prosody | ❌ | ✅ | ✅ | ✅ |
| Say-as (dates, numbers) | ❌ | ❌ | ✅ | ✅ |
Caching Strategy
Cache frequently used phrases:
type TTSCache struct {
cache map[string][]byte
mu sync.RWMutex
}
func (c *TTSCache) GetOrGenerate(text string, voice string) []byte {
key := fmt.Sprintf("%s:%s", voice, hash(text))
c.mu.RLock()
if audio, ok := c.cache[key]; ok {
c.mu.RUnlock()
return audio
}
c.mu.RUnlock()
// Generate and cache
audio := tts.Synthesize(text, voice)
c.mu.Lock()
c.cache[key] = audio
c.mu.Unlock()
return audio
}
// Pre-cache common phrases
func preCacheGreetings(agent *Agent) {
phrases := []string{
agent.GreetingMessage,
"One moment please.",
"I'm looking that up for you.",
"Is there anything else I can help with?",
"Thank you for calling. Goodbye!",
}
for _, phrase := range phrases {
cache.GetOrGenerate(phrase, agent.TTSVoice)
}
}
Error Handling
func (t *TTS) synthesizeWithFallback(text string) ([]byte, error) {
// Try primary provider
audio, err := t.primary.Synthesize(text)
if err == nil {
return audio, nil
}
log.Printf("Primary TTS failed: %v, trying fallback", err)
// Try fallback provider
audio, err = t.fallback.Synthesize(text)
if err != nil {
return nil, fmt.Errorf("all TTS providers failed: %w", err)
}
return audio, nil
}
Next Steps
- Cartesia Configuration - Fastest TTS
- ElevenLabs Configuration - Premium quality
- Azure Configuration - Enterprise + Indic
- Voice Selection Guide - Choose the right voice