Deepgram STT
Deepgram is our recommended STT provider for English-language voice agents due to its industry-leading latency and accuracy.
Why Deepgram?
- Lowest Latency: ~150ms streaming latency
- High Accuracy: Nova-3 achieves <5% WER on clean audio
- Interim Results: Real-time partial transcriptions
- Cost Effective: ~₹0.35/minute
Configuration
Environment Variables
DEEPGRAM_API_KEY=your_deepgram_api_key
Agent Configuration
{
"sttProvider": "deepgram",
"sttConfig": {
"model": "nova-3",
"language": "en",
"endpointing": 200,
"utterance_end_ms": 1000,
"interim_results": true,
"smart_format": true,
"punctuate": true,
"diarize": false
}
}
Model Options
| Model | Accuracy | Latency | Cost | Best For |
|---|---|---|---|---|
nova-3 |
Highest | ~150ms | $$$ | Production |
nova-2 |
High | ~150ms | $$ | Cost-sensitive |
nova |
Good | ~150ms | $ | Basic use |
enhanced |
Good | ~200ms | $ | General |
base |
Basic | ~150ms | $ | Development |
Endpointing Settings
Fine-tune when Deepgram determines the user has finished speaking:
type DeepgramConfig struct {
Endpointing int `json:"endpointing"` // Silence before final (ms)
UtteranceEndMs int `json:"utterance_end_ms"` // Max mid-utterance silence
InterimResults bool `json:"interim_results"` // Enable partial results
}
Recommended Settings
For conversational agents (faster response):
{
"endpointing": 200,
"utterance_end_ms": 1000,
"interim_results": true
}
For dictation/longer utterances:
{
"endpointing": 500,
"utterance_end_ms": 2000,
"interim_results": true
}
Streaming Implementation
Our implementation uses WebSocket streaming for lowest latency:
// Simplified streaming flow
func (d *DeepgramSTT) StreamTranscribe(ctx context.Context, audioChan <-chan []byte) <-chan string {
resultChan := make(chan string)
go func() {
// Connect to Deepgram WebSocket
conn := d.connect(ctx)
// Send audio chunks
go func() {
for audio := range audioChan {
conn.WriteMessage(websocket.BinaryMessage, audio)
}
}()
// Receive transcriptions
for {
_, msg, err := conn.ReadMessage()
if err != nil {
break
}
var result DeepgramResult
json.Unmarshal(msg, &result)
if result.IsFinal {
resultChan <- result.Transcript
}
}
}()
return resultChan
}
Interim Results
Interim results allow the LLM to start generating responses before the user finishes speaking:
Timeline:
──────────────────────────────────────────────────▶
User speaking: "What is my order status for twelve thirty four"
│ │ │
Interim 1: "What is" │ │
Interim 2: "What is my order" │
Interim 3: "What is my order status for"
Final: "What is my order status for 1234"
│
LLM starts: ────────▶ [Can start early!]
Smart Formatting
Enable smart formatting for better transcription quality:
{
"smart_format": true,
"punctuate": true,
"numerals": true
}
Before smart format: "one two three four" After smart format: "1234"
Language Support
| Language | Code | Model Support |
|---|---|---|
| English | en, en-US, en-GB |
All models |
| Spanish | es |
Nova-2, Nova-3 |
| French | fr |
Nova-2, Nova-3 |
| German | de |
Nova-2, Nova-3 |
| Hindi | hi |
Limited |
Troubleshooting
High Word Error Rate
- Check audio quality (sample rate, encoding)
- Enable noise reduction on client side
- Use
nova-3model for best accuracy
Delayed Transcription
- Reduce
endpointingvalue - Enable
interim_results - Check network latency to Deepgram
Missing Words
- Check for audio clipping
- Verify VAD isn't cutting off speech
- Increase
utterance_end_ms
Pricing
| Plan | Cost | Included |
|---|---|---|
| Pay-as-you-go | $0.0043/min | - |
| Growth | $0.0036/min | Support |
| Enterprise | Custom | SLA, support |
Prices as of 2024. Check Deepgram Pricing for current rates.
Next Steps
- Google Chirp - Multi-language alternative
- TTS Providers - Text-to-Speech options
- Latency Optimization - Further improvements