Deepgram STT
Deepgram Nova-3 is our recommended STT provider for voice agents due to its industry-leading latency and accuracy.
Why Deepgram?
| Feature |
Deepgram Nova-3 |
Competitors |
| Time to First Partial |
~80ms |
120-300ms |
| Word Error Rate |
8.4% |
10-15% |
| Endpointing |
Smart |
Basic |
| Streaming |
Full support |
Varies |
| Cost |
$0.0043/min |
$0.006-0.016/min |
Configuration
Basic Setup
{
"agent": {
"name": "Customer Support",
"sttProvider": "deepgram",
"sttModel": "nova-3"
}
}
Environment Variables
DEEPGRAM_API_KEY=your_deepgram_api_key
Advanced Configuration
{
"sttProvider": "deepgram",
"sttModel": "nova-3",
"sttConfig": {
"language": "en-US",
"punctuate": true,
"profanity_filter": false,
"diarize": false,
"smart_format": true,
"filler_words": false,
"endpointing": 300,
"utterance_end_ms": 1000,
"interim_results": true,
"vad_events": true
}
}
Configuration Options
Core Settings
| Parameter |
Type |
Default |
Description |
language |
string |
en-US |
Language code (BCP-47) |
model |
string |
nova-3 |
Model version |
tier |
string |
nova |
Processing tier |
| Parameter |
Type |
Default |
Description |
punctuate |
bool |
true |
Add punctuation |
smart_format |
bool |
true |
Format numbers, dates, etc. |
numerals |
bool |
false |
Convert words to digits |
profanity_filter |
bool |
false |
Censor profanity |
filler_words |
bool |
false |
Include "um", "uh" |
Endpointing (Critical for Voice)
| Parameter |
Type |
Default |
Description |
endpointing |
int |
300 |
Silence (ms) to trigger is_final |
utterance_end_ms |
int |
1000 |
Max wait for speech completion |
interim_results |
bool |
true |
Stream partial results |
vad_events |
bool |
true |
Emit VAD start/stop events |
Endpointing Tuning
Endpointing determines when the user has finished speaking:
endpointing = 300ms (default)
─────────────────────────────────────────────────────────────
User: "What is my order status" [300ms silence] → is_final
Good for: Normal conversation pace
endpointing = 150ms (aggressive)
─────────────────────────────────────────────────────────────
User: "What is my order" [150ms] → is_final (too early!)
User: " status" ← This gets cut off
Risk: Cutting off slow speakers
endpointing = 500ms (conservative)
─────────────────────────────────────────────────────────────
User: "What is my order status" [500ms silence] → is_final
Trade-off: Higher latency, but won't cut off
Good for: Elderly users, complex queries
Per-Agent Endpointing
Configure based on use case:
// Fast-paced customer service
{
"sttConfig": {
"endpointing": 250,
"utterance_end_ms": 800
}
}
// Elderly or accessibility-focused
{
"sttConfig": {
"endpointing": 500,
"utterance_end_ms": 1500
}
}
// Dictation or complex input
{
"sttConfig": {
"endpointing": 700,
"utterance_end_ms": 2000
}
}
Implementation
WebSocket Connection
type DeepgramSTT struct {
conn *websocket.Conn
apiKey string
config DeepgramConfig
eventChan chan TranscriptEvent
}
func (d *DeepgramSTT) Connect(ctx context.Context) error {
// Build WebSocket URL with parameters
params := url.Values{}
params.Set("model", d.config.Model)
params.Set("language", d.config.Language)
params.Set("punctuate", strconv.FormatBool(d.config.Punctuate))
params.Set("endpointing", strconv.Itoa(d.config.Endpointing))
params.Set("interim_results", "true")
params.Set("vad_events", "true")
params.Set("encoding", "linear16")
params.Set("sample_rate", "8000")
params.Set("channels", "1")
wsURL := fmt.Sprintf("wss://api.deepgram.com/v1/listen?%s", params.Encode())
headers := http.Header{}
headers.Set("Authorization", "Token "+d.apiKey)
conn, _, err := websocket.DefaultDialer.DialContext(ctx, wsURL, headers)
if err != nil {
return fmt.Errorf("deepgram connect: %w", err)
}
d.conn = conn
go d.receiveLoop()
return nil
}
Sending Audio
func (d *DeepgramSTT) SendAudio(audio []byte) error {
return d.conn.WriteMessage(websocket.BinaryMessage, audio)
}
// In the audio processing pipeline
func processAudio(audioChunk []byte) {
// Convert μ-law to Linear16 if needed
linear := mulawToLinear16(audioChunk)
// Send to Deepgram
stt.SendAudio(linear)
}
Receiving Transcripts
func (d *DeepgramSTT) receiveLoop() {
for {
_, msg, err := d.conn.ReadMessage()
if err != nil {
return
}
var response DeepgramResponse
json.Unmarshal(msg, &response)
// Handle different message types
switch response.Type {
case "Results":
d.handleResults(response)
case "SpeechStarted":
d.eventChan <- TranscriptEvent{Type: EventSpeechStart}
case "UtteranceEnd":
d.eventChan <- TranscriptEvent{Type: EventUtteranceEnd}
}
}
}
func (d *DeepgramSTT) handleResults(resp DeepgramResponse) {
if len(resp.Channel.Alternatives) == 0 {
return
}
alt := resp.Channel.Alternatives[0]
d.eventChan <- TranscriptEvent{
Text: alt.Transcript,
IsFinal: resp.IsFinal,
Confidence: alt.Confidence,
Words: alt.Words,
}
}
Model Comparison
| Model |
Speed |
Accuracy |
Cost |
Use Case |
| nova-3 |
⚡⚡⚡ |
⭐⭐⭐⭐⭐ |
$0.0043/min |
Production (recommended) |
| nova-2 |
⚡⚡⚡ |
⭐⭐⭐⭐ |
$0.0043/min |
Legacy support |
| enhanced |
⚡⚡ |
⭐⭐⭐⭐ |
$0.0145/min |
Phone audio |
| base |
⚡⚡⚡ |
⭐⭐⭐ |
$0.0125/min |
Cost-sensitive |
Language Support
Tier 1 (Excellent)
| Language |
Code |
Accuracy |
| English (US) |
en-US |
⭐⭐⭐⭐⭐ |
| English (UK) |
en-GB |
⭐⭐⭐⭐⭐ |
| English (AU) |
en-AU |
⭐⭐⭐⭐⭐ |
| Spanish |
es |
⭐⭐⭐⭐⭐ |
| French |
fr |
⭐⭐⭐⭐⭐ |
| German |
de |
⭐⭐⭐⭐⭐ |
| Portuguese |
pt |
⭐⭐⭐⭐⭐ |
Tier 2 (Good)
| Language |
Code |
Accuracy |
| Hindi |
hi |
⭐⭐⭐ |
| Japanese |
ja |
⭐⭐⭐⭐ |
| Korean |
ko |
⭐⭐⭐⭐ |
| Chinese |
zh |
⭐⭐⭐⭐ |
| Dutch |
nl |
⭐⭐⭐⭐ |
| Italian |
it |
⭐⭐⭐⭐ |
Custom Vocabulary
Add domain-specific terms for better accuracy:
{
"sttConfig": {
"keywords": [
"Edesy:2",
"voice agent:2",
"STT:1.5",
"TTS:1.5"
]
}
}
The number after the colon is a boost factor (0.0-3.0). Higher values make Deepgram more likely to recognize that term.
Error Handling
func (d *DeepgramSTT) handleError(err error) {
var wsErr *websocket.CloseError
if errors.As(err, &wsErr) {
switch wsErr.Code {
case 1008: // Policy Violation
log.Error("Deepgram: Invalid API key or quota exceeded")
// Switch to fallback provider
case 1011: // Internal Error
log.Error("Deepgram: Server error, reconnecting...")
d.reconnect()
}
}
}
Best Practices
1. Use Interim Results for UX
// Show "thinking" indicator during speech
for event := range stt.Events() {
if !event.IsFinal && len(event.Text) > 0 {
ui.ShowTypingIndicator()
}
}
2. Handle Network Issues
// Automatic reconnection with backoff
func (d *DeepgramSTT) reconnect() {
backoff := 100 * time.Millisecond
maxBackoff := 5 * time.Second
for {
err := d.Connect(context.Background())
if err == nil {
return
}
time.Sleep(backoff)
backoff = min(backoff*2, maxBackoff)
}
}
// Track key metrics
metrics.RecordHistogram("stt.deepgram.latency_ms", latency.Milliseconds())
metrics.RecordCounter("stt.deepgram.transcripts_total", 1)
metrics.RecordHistogram("stt.deepgram.confidence", confidence)
Troubleshooting
| Issue |
Cause |
Solution |
| High latency |
Wrong endpoint region |
Use nearest regional endpoint |
| Poor accuracy |
Wrong language code |
Verify BCP-47 language code |
| No interim results |
Parameter not set |
Add interim_results=true |
| Cut-off speech |
Endpointing too aggressive |
Increase endpointing value |
| Missing words |
Audio too quiet |
Check audio levels, add volume normalization |
Next Steps