STT Providers Overview
Speech-to-Text (STT) converts the caller's audio into text that the LLM can understand. Choosing the right STT provider significantly impacts accuracy, latency, and cost.
Supported Providers
| Provider | Model | Latency | Languages | Best For |
|---|---|---|---|---|
| Deepgram | Nova-3 | ⚡ Fastest | 35+ | Production voice agents |
| Chirp 2 | 🚀 Fast | 125+ | Indic languages | |
| Azure | Neural | 🚀 Fast | 100+ | Enterprise |
| ElevenLabs | Scribe | 🚀 Fast | 99+ | Regional languages |
| AssemblyAI | Universal-2 | 🚀 Fast | 50+ | Accuracy-focused |
| OpenAI | Whisper | 🐢 Moderate | 100+ | Multilingual |
Quick Comparison
Latency (Time to First Partial, lower is better):
──────────────────────────────────────────────────────────────────
Deepgram Nova-3 ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 80ms
Google Chirp ████████░░░░░░░░░░░░░░░░░░░░░░░░ 120ms
Azure Neural ████████░░░░░░░░░░░░░░░░░░░░░░░░ 130ms
ElevenLabs ██████████░░░░░░░░░░░░░░░░░░░░░░ 150ms
AssemblyAI ████████████░░░░░░░░░░░░░░░░░░░░ 180ms
OpenAI Whisper ████████████████████████████████ 300ms
0ms 150ms 300ms
Choosing the Right Provider
For Low Latency (Recommended)
Deepgram Nova-3
- Fastest time-to-first-partial (~80ms)
- Excellent for real-time voice agents
- Best endpointing (detects speech completion)
- Smart formatting included
{
"sttProvider": "deepgram",
"sttModel": "nova-3"
}
For Indic Languages
Google Chirp 2
- Best accuracy for Hindi, Tamil, Telugu, Bengali
- 125+ languages supported
- Chirp 2 is optimized for telephony
{
"sttProvider": "google",
"sttModel": "chirp_2"
}
For Regional Languages
ElevenLabs Scribe
- Excellent for Assamese, Odia, Punjabi
- 99 languages with good regional coverage
- Competitive pricing
{
"sttProvider": "elevenlabs"
}
For Enterprise
Azure Neural
- Enterprise SLAs and compliance
- Custom speech models available
- Global deployment options
{
"sttProvider": "azure"
}
Streaming vs Batch
All our STT integrations use streaming for real-time voice agents:
Batch STT (not suitable for voice):
────────────────────────────────────
User speaks for 5 seconds → Wait → Get full transcript
Streaming STT (what we use):
────────────────────────────────────
User speaks → Interim results every 100ms → Final transcript
│
└── LLM can start preparing response
Interim Results
Interim results allow faster response preparation:
// STT emits interim results as user speaks
type TranscriptEvent struct {
Text string
IsFinal bool
Stability float32 // 0.0-1.0, higher = more stable
}
// Example stream for "What is my order status?"
// t=100ms: {Text: "What", IsFinal: false, Stability: 0.8}
// t=200ms: {Text: "What is", IsFinal: false, Stability: 0.85}
// t=300ms: {Text: "What is my", IsFinal: false, Stability: 0.9}
// t=500ms: {Text: "What is my order", IsFinal: false, Stability: 0.92}
// t=700ms: {Text: "What is my order status", IsFinal: true, Stability: 1.0}
Endpointing
Endpointing detects when the user has finished speaking:
| Provider | Endpointing | Configurable |
|---|---|---|
| Deepgram | ⭐⭐⭐⭐⭐ Smart | Yes |
| ⭐⭐⭐⭐ Good | Yes | |
| Azure | ⭐⭐⭐⭐ Good | Yes |
| ElevenLabs | ⭐⭐⭐ Basic | Limited |
| AssemblyAI | ⭐⭐⭐⭐ Good | Yes |
Deepgram Endpointing Configuration
{
"sttProvider": "deepgram",
"sttConfig": {
"endpointing": 300,
"utterance_end_ms": 1000,
"interim_results": true
}
}
| Parameter | Default | Description |
|---|---|---|
endpointing |
300 | Silence duration (ms) to trigger is_final |
utterance_end_ms |
1000 | Maximum wait for speech completion |
interim_results |
true | Enable real-time partial transcripts |
Language Support Matrix
| Language | Deepgram | Azure | ElevenLabs | AssemblyAI | |
|---|---|---|---|---|---|
| English (US) | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Hindi | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Spanish | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| French | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| German | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Tamil | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Telugu | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Bengali | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Assamese | ❌ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ❌ |
| Japanese | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
Cost Comparison
| Provider | Cost per Minute | Monthly (1M minutes) |
|---|---|---|
| Deepgram Nova-3 | $0.0043 | $4,300 |
| Google Chirp | $0.016 | $16,000 |
| Azure Neural | $0.016 | $16,000 |
| ElevenLabs | $0.007 | $7,000 |
| AssemblyAI | $0.0055 | $5,500 |
| OpenAI Whisper | $0.006 | $6,000 |
Audio Requirements
| Parameter | Requirement | Notes |
|---|---|---|
| Sample Rate | 8000 Hz | Telephony standard (μ-law) |
| Channels | Mono | Single channel |
| Bit Depth | 16-bit | Linear PCM |
| Encoding | Linear16 or μ-law | Provider dependent |
Audio Conversion
// Telephony sends μ-law, STT may need Linear16
func mulawToLinear16(mulaw []byte) []int16 {
linear := make([]int16, len(mulaw))
for i, sample := range mulaw {
linear[i] = mulawToLinearSample(sample)
}
return linear
}
Error Handling
Handle STT failures gracefully:
func handleSTTError(err error) {
switch {
case errors.Is(err, ErrAudioTooQuiet):
// Ask user to speak louder
tts.Speak("I'm having trouble hearing you. Could you speak a bit louder?")
case errors.Is(err, ErrConnectionLost):
// Reconnect automatically
stt.Reconnect()
case errors.Is(err, ErrRateLimited):
// Use fallback provider
stt.SwitchProvider("fallback")
}
}
Next Steps
- Deepgram Configuration - Fastest STT for voice
- Google Chirp - Best for Indic languages
- Azure Speech - Enterprise deployment
- ElevenLabs Scribe - Regional languages