STT Providers Overview
Speech-to-Text (STT) is a critical component of your voice agent. The right choice depends on your language requirements, latency needs, and budget.
Supported Providers
| Provider | Languages | Latency | Cost/min | Best For |
|---|---|---|---|---|
| Deepgram | 30+ | ~150ms | ~₹0.35 | Low latency, English |
| Google Chirp | 100+ | ~200ms | ~₹1.34 | Multi-language |
| Azure Speech | 100+ | ~180ms | ~₹0.84 | Enterprise, Indic |
| ElevenLabs Scribe | 30+ | ~250ms | ~₹0.56 | Indic languages |
| AssemblyAI | 10+ | ~200ms | ~₹0.63 | Accuracy |
| OpenAI Whisper | 50+ | ~300ms | ~₹0.50 | Quality over speed |
Choosing a Provider
For Lowest Latency
Recommended: Deepgram Nova-3
- Industry-leading streaming latency (~150ms)
- Excellent English accuracy
- Interim results for faster response
For Indian Languages
Recommended: ElevenLabs Scribe or Google Chirp
- Hindi, Tamil, Telugu, Assamese support
- Reasonable accuracy (10-25% WER)
- Good cost-performance ratio
For Enterprise
Recommended: Azure Speech
- SOC 2, HIPAA compliance
- Excellent Indic language support
- Custom model training available
Configuration
Agent-Level Configuration
Each agent can have its own STT provider:
{
"name": "Hindi Support Agent",
"language": "hi",
"sttProvider": "elevenlabs",
"sttModel": "scribe",
"sttConfig": {
"endpointing": 300,
"utterance_end_ms": 1500
}
}
Provider-Specific Settings
Deepgram
{
"sttProvider": "deepgram",
"sttConfig": {
"model": "nova-3",
"endpointing": 200,
"utterance_end_ms": 1000,
"interim_results": true,
"smart_format": true
}
}
Google Chirp
{
"sttProvider": "google",
"sttConfig": {
"model": "chirp",
"enable_automatic_punctuation": true,
"enable_spoken_punctuation": false
}
}
Latency Comparison
User finishes speaking
│
▼
┌───────────────────────────────────────────────┐
│ Deepgram Nova-3 ████████░░░░░░░░ 150ms │
│ Azure Speech █████████░░░░░░░ 180ms │
│ Google Chirp ██████████░░░░░░ 200ms │
│ AssemblyAI ██████████░░░░░░ 200ms │
│ ElevenLabs Scribe ████████████░░░░ 250ms │
│ OpenAI Whisper ███████████████░ 300ms │
└───────────────────────────────────────────────┘
→ Time
Streaming vs Batch
Streaming STT (Recommended)
- Real-time transcription as user speaks
- Interim results enable faster LLM response
- Lower perceived latency
User: "What is my order—"
STT: [interim] "What is my"
STT: [interim] "What is my order"
STT: [final] "What is my order status?"
Batch STT
- Full audio processed at once
- Higher accuracy potential
- Higher latency
Endpointing Configuration
Endpointing determines when the user has finished speaking:
| Setting | Description | Recommended |
|---|---|---|
endpointing |
Silence before end-of-turn (ms) | 200-400ms |
utterance_end_ms |
Max silence within utterance | 1000-1500ms |
vad_threshold |
Voice activity threshold | 0.7-0.9 |
Aggressive (faster response):
{
"endpointing": 200,
"utterance_end_ms": 800
}
Conservative (more complete sentences):
{
"endpointing": 400,
"utterance_end_ms": 1500
}
Cost Optimization
- Use appropriate model: Nova-2 is cheaper than Nova-3
- Optimize audio: Compress silence, use VAD
- Cache common phrases: Skip STT for known patterns
- Batch non-real-time: Use batch API for recordings
Next Steps
- Deepgram Configuration - Detailed Deepgram setup
- TTS Providers - Text-to-Speech options
- Latency Optimization - Further reduce response time