ElevenLabs TTS
ElevenLabs offers the most natural-sounding AI voices with advanced emotion control, making it ideal for premium voice experiences.
Why ElevenLabs?
| Feature | ElevenLabs | Cartesia |
|---|---|---|
| Voice Quality | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Emotion Control | ✅ Advanced | Limited |
| Voice Cloning | ✅ Yes | ❌ No |
| Time to First Chunk | ~150ms | ~50ms |
| Cost | $0.18/1K chars | $0.015/1K chars |
Best for: Premium experiences, brand voices, emotion-sensitive applications.
Configuration
Basic Setup
{
"agent": {
"name": "Premium Support",
"ttsProvider": "elevenlabs",
"ttsVoice": "21m00Tcm4TlvDq8ikWAM"
}
}
Environment Variables
ELEVENLABS_API_KEY=your_elevenlabs_api_key
Advanced Configuration
{
"ttsProvider": "elevenlabs",
"ttsVoice": "21m00Tcm4TlvDq8ikWAM",
"ttsConfig": {
"model_id": "eleven_turbo_v2_5",
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"use_speaker_boost": true,
"output_format": "pcm_16000"
}
}
Available Voices
Pre-made Voices
| Voice ID | Name | Gender | Style |
|---|---|---|---|
| 21m00Tcm4TlvDq8ikWAM | Rachel | Female | Warm, conversational |
| AZnzlk1XvdvUeBnXmlld | Domi | Female | Confident, assertive |
| EXAVITQu4vr4xnSDxMaL | Bella | Female | Soft, gentle |
| ErXwobaYiN019PkySvjV | Antoni | Male | Professional |
| MF3mGyEYCl7XYWbV9V6O | Elli | Female | Young, friendly |
| TxGEqnHWrfWFTfGW9XjX | Josh | Male | Deep, authoritative |
| VR6AewLTigWG4xSOukaG | Arnold | Male | Energetic |
| pNInz6obpgDQGcFmaJgB | Adam | Male | Clear, neutral |
| yoZ06aMxZJJ28mfd3POQ | Sam | Male | Casual, friendly |
Multilingual Voices
| Voice ID | Name | Languages |
|---|---|---|
| eleven_multilingual_v2 | Multilingual | 29 languages |
| eleven_turbo_v2_5 | Turbo | English optimized |
Implementation
WebSocket Streaming
type ElevenLabsTTS struct {
apiKey string
voiceID string
modelID string
conn *websocket.Conn
}
func (e *ElevenLabsTTS) Connect(ctx context.Context) error {
wsURL := fmt.Sprintf(
"wss://api.elevenlabs.io/v1/text-to-speech/%s/stream-input?model_id=%s",
e.voiceID,
e.modelID,
)
headers := http.Header{}
headers.Set("xi-api-key", e.apiKey)
conn, _, err := websocket.DefaultDialer.DialContext(ctx, wsURL, headers)
if err != nil {
return err
}
e.conn = conn
// Send initial config
config := map[string]any{
"text": " ",
"voice_settings": map[string]any{
"stability": 0.5,
"similarity_boost": 0.75,
},
"generation_config": map[string]any{
"chunk_length_schedule": []int{120, 160, 250, 290},
},
"xi_api_key": e.apiKey,
}
return conn.WriteJSON(config)
}
Streaming Text
func (e *ElevenLabsTTS) StreamText(text string) <-chan []byte {
audioChan := make(chan []byte)
go func() {
defer close(audioChan)
// Send text
msg := map[string]any{
"text": text,
"try_trigger_generation": true,
}
e.conn.WriteJSON(msg)
// Signal end of input
e.conn.WriteJSON(map[string]any{
"text": "",
})
// Receive audio chunks
for {
_, data, err := e.conn.ReadMessage()
if err != nil {
return
}
var response ElevenLabsResponse
json.Unmarshal(data, &response)
if response.Audio != "" {
audio, _ := base64.StdEncoding.DecodeString(response.Audio)
audioChan <- audio
}
if response.IsFinal {
return
}
}
}()
return audioChan
}
Voice Settings
Fine-tune voice characteristics:
Stability
Controls consistency vs expressiveness:
type VoiceSettings struct {
Stability float64 // 0.0-1.0
SimilarityBoost float64 // 0.0-1.0
Style float64 // 0.0-1.0
SpeakerBoost bool
}
// More stable (consistent)
stable := VoiceSettings{
Stability: 0.8,
SimilarityBoost: 0.5,
}
// More expressive (variable)
expressive := VoiceSettings{
Stability: 0.3,
SimilarityBoost: 0.8,
Style: 0.5,
}
| Setting | Low (0.0-0.3) | Medium (0.4-0.6) | High (0.7-1.0) |
|---|---|---|---|
| Stability | Expressive, varied | Balanced | Consistent, monotone |
| Similarity | Different from original | Balanced | Close to original |
| Style | Neutral | Some emotion | Strong emotion |
Emotion Control
Add emotional nuance:
// Dynamic emotion based on context
func getVoiceSettings(sentiment string) VoiceSettings {
switch sentiment {
case "empathetic":
return VoiceSettings{
Stability: 0.4,
SimilarityBoost: 0.7,
Style: 0.3,
}
case "professional":
return VoiceSettings{
Stability: 0.7,
SimilarityBoost: 0.5,
Style: 0.1,
}
case "enthusiastic":
return VoiceSettings{
Stability: 0.3,
SimilarityBoost: 0.8,
Style: 0.6,
}
default:
return VoiceSettings{
Stability: 0.5,
SimilarityBoost: 0.75,
Style: 0.0,
}
}
}
Voice Cloning
Create custom brand voices:
Instant Voice Cloning
func cloneVoice(audioSamples [][]byte, name string, description string) (*Voice, error) {
client := NewElevenLabsClient(apiKey)
// Upload samples (minimum 1 minute of clean audio)
voice, err := client.CloneVoice(CloneParams{
Name: name,
Description: description,
Samples: audioSamples,
Labels: map[string]string{
"accent": "american",
"gender": "female",
"age": "adult",
"use_case": "customer_support",
},
})
return voice, err
}
Professional Voice Cloning
For highest quality (requires ElevenLabs approval):
- Record 30+ minutes of clean audio
- Submit for professional cloning
- Receive custom voice model
Model Comparison
| Model | Latency | Quality | Languages | Cost |
|---|---|---|---|---|
| eleven_turbo_v2_5 | ⚡ Fastest | ⭐⭐⭐⭐ | English | $0.15/1K |
| eleven_multilingual_v2 | 🚀 Fast | ⭐⭐⭐⭐⭐ | 29 | $0.18/1K |
| eleven_monolingual_v1 | 🚀 Fast | ⭐⭐⭐⭐ | English | $0.15/1K |
{
"ttsConfig": {
"model_id": "eleven_turbo_v2_5"
}
}
Audio Output Formats
{
"ttsConfig": {
"output_format": "pcm_16000"
}
}
| Format | Sample Rate | Best For |
|---|---|---|
| pcm_16000 | 16 kHz | Voice agents |
| pcm_22050 | 22.05 kHz | Higher quality |
| pcm_24000 | 24 kHz | Highest quality |
| mp3_44100_128 | 44.1 kHz | Playback/storage |
Cost Optimization
Pricing
| Plan | Price per 1K chars | Monthly Chars |
|---|---|---|
| Free | $0 | 10K |
| Starter | $0.30 | 30K |
| Creator | $0.24 | 100K |
| Pro | $0.18 | 500K |
| Scale | $0.12 | 2M+ |
Optimization Strategies
// 1. Cache common phrases
type TTSCache struct {
cache map[string][]byte
}
func (c *TTSCache) GetOrGenerate(text, voice string) []byte {
key := fmt.Sprintf("%s:%s", voice, hash(text))
if audio, ok := c.cache[key]; ok {
return audio // Free!
}
audio := tts.Synthesize(text)
c.cache[key] = audio
return audio
}
// 2. Use Turbo for English
func selectModel(language string) string {
if strings.HasPrefix(language, "en") {
return "eleven_turbo_v2_5" // Cheaper
}
return "eleven_multilingual_v2"
}
// 3. Shorten responses
func optimizeText(text string) string {
// Remove filler words
text = strings.ReplaceAll(text, "actually, ", "")
text = strings.ReplaceAll(text, "basically, ", "")
return text
}
Error Handling
func (e *ElevenLabsTTS) handleError(err error) {
switch {
case strings.Contains(err.Error(), "quota_exceeded"):
log.Error("Character quota exceeded")
e.switchToFallback()
case strings.Contains(err.Error(), "voice_not_found"):
log.Error("Voice ID invalid")
e.useDefaultVoice()
case strings.Contains(err.Error(), "rate_limit"):
log.Warn("Rate limited")
time.Sleep(time.Second)
e.retry()
}
}
func (e *ElevenLabsTTS) switchToFallback() {
// Use Cartesia as fallback
e.fallback = NewCartesiaTTS(e.cartesiaKey)
}
Best Practices
1. Pre-cache Greetings
func preCacheCommonPhrases(agent *Agent) {
phrases := []string{
agent.GreetingMessage,
"One moment please.",
"I understand.",
"Is there anything else?",
"Thank you, goodbye.",
}
for _, phrase := range phrases {
cache.Generate(phrase, agent.TTSVoice)
}
}
2. Adjust Settings by Context
func (e *ElevenLabsTTS) synthesizeWithContext(text string, context *ConversationContext) {
settings := e.defaultSettings
// More empathetic for complaints
if context.Sentiment == "negative" {
settings.Stability = 0.4
settings.Style = 0.3
}
// More professional for business
if context.Topic == "billing" {
settings.Stability = 0.7
settings.Style = 0.1
}
e.synthesize(text, settings)
}
3. Handle Long Text
func (e *ElevenLabsTTS) synthesizeLongText(text string) {
// Split at sentence boundaries
sentences := splitIntoSentences(text)
for _, sentence := range sentences {
for chunk := range e.StreamText(sentence) {
e.output <- chunk
}
}
}
Next Steps
- Cartesia - Lower latency alternative
- Azure TTS - Enterprise option
- Voice Selection Guide - Choosing voices