Deepgram STT

Deepgram Nova-3 is our recommended STT provider for voice agents due to its industry-leading latency and accuracy.

Why Deepgram?

Feature	Deepgram Nova-3	Competitors
Time to First Partial	~80ms	120-300ms
Word Error Rate	8.4%	10-15%
Endpointing	Smart	Basic
Streaming	Full support	Varies
Cost	$0.0043/min	$0.006-0.016/min

Configuration

Basic Setup

{
  "agent": {
    "name": "Customer Support",
    "sttProvider": "deepgram",
    "sttModel": "nova-3"
  }
}

Environment Variables

DEEPGRAM_API_KEY=your_deepgram_api_key

Advanced Configuration

{
  "sttProvider": "deepgram",
  "sttModel": "nova-3",
  "sttConfig": {
    "language": "en-US",
    "punctuate": true,
    "profanity_filter": false,
    "diarize": false,
    "smart_format": true,
    "filler_words": false,
    "endpointing": 300,
    "utterance_end_ms": 1000,
    "interim_results": true,
    "vad_events": true
  }
}

Configuration Options

Core Settings

Parameter	Type	Default	Description
`language`	string	`en-US`	Language code (BCP-47)
`model`	string	`nova-3`	Model version
`tier`	string	`nova`	Processing tier

Formatting

Parameter	Type	Default	Description
`punctuate`	bool	true	Add punctuation
`smart_format`	bool	true	Format numbers, dates, etc.
`numerals`	bool	false	Convert words to digits
`profanity_filter`	bool	false	Censor profanity
`filler_words`	bool	false	Include "um", "uh"

Endpointing (Critical for Voice)

Parameter	Type	Default	Description
`endpointing`	int	300	Silence (ms) to trigger is_final
`utterance_end_ms`	int	1000	Max wait for speech completion
`interim_results`	bool	true	Stream partial results
`vad_events`	bool	true	Emit VAD start/stop events

Endpointing Tuning

Endpointing determines when the user has finished speaking:

endpointing = 300ms (default)
─────────────────────────────────────────────────────────────
User: "What is my order status" [300ms silence] → is_final
Good for: Normal conversation pace

endpointing = 150ms (aggressive)
─────────────────────────────────────────────────────────────
User: "What is my order" [150ms] → is_final (too early!)
User: " status" ← This gets cut off
Risk: Cutting off slow speakers

endpointing = 500ms (conservative)
─────────────────────────────────────────────────────────────
User: "What is my order status" [500ms silence] → is_final
Trade-off: Higher latency, but won't cut off
Good for: Elderly users, complex queries

Per-Agent Endpointing

Configure based on use case:

// Fast-paced customer service
{
  "sttConfig": {
    "endpointing": 250,
    "utterance_end_ms": 800
  }
}

// Elderly or accessibility-focused
{
  "sttConfig": {
    "endpointing": 500,
    "utterance_end_ms": 1500
  }
}

// Dictation or complex input
{
  "sttConfig": {
    "endpointing": 700,
    "utterance_end_ms": 2000
  }
}

Implementation

WebSocket Connection

type DeepgramSTT struct {
    conn      *websocket.Conn
    apiKey    string
    config    DeepgramConfig
    eventChan chan TranscriptEvent
}

func (d *DeepgramSTT) Connect(ctx context.Context) error {
    // Build WebSocket URL with parameters
    params := url.Values{}
    params.Set("model", d.config.Model)
    params.Set("language", d.config.Language)
    params.Set("punctuate", strconv.FormatBool(d.config.Punctuate))
    params.Set("endpointing", strconv.Itoa(d.config.Endpointing))
    params.Set("interim_results", "true")
    params.Set("vad_events", "true")
    params.Set("encoding", "linear16")
    params.Set("sample_rate", "8000")
    params.Set("channels", "1")

    wsURL := fmt.Sprintf("wss://api.deepgram.com/v1/listen?%s", params.Encode())

    headers := http.Header{}
    headers.Set("Authorization", "Token "+d.apiKey)

    conn, _, err := websocket.DefaultDialer.DialContext(ctx, wsURL, headers)
    if err != nil {
        return fmt.Errorf("deepgram connect: %w", err)
    }

    d.conn = conn
    go d.receiveLoop()

    return nil
}

Sending Audio

func (d *DeepgramSTT) SendAudio(audio []byte) error {
    return d.conn.WriteMessage(websocket.BinaryMessage, audio)
}

// In the audio processing pipeline
func processAudio(audioChunk []byte) {
    // Convert μ-law to Linear16 if needed
    linear := mulawToLinear16(audioChunk)

    // Send to Deepgram
    stt.SendAudio(linear)
}

Receiving Transcripts

func (d *DeepgramSTT) receiveLoop() {
    for {
        _, msg, err := d.conn.ReadMessage()
        if err != nil {
            return
        }

        var response DeepgramResponse
        json.Unmarshal(msg, &response)

        // Handle different message types
        switch response.Type {
        case "Results":
            d.handleResults(response)
        case "SpeechStarted":
            d.eventChan <- TranscriptEvent{Type: EventSpeechStart}
        case "UtteranceEnd":
            d.eventChan <- TranscriptEvent{Type: EventUtteranceEnd}
        }
    }
}

func (d *DeepgramSTT) handleResults(resp DeepgramResponse) {
    if len(resp.Channel.Alternatives) == 0 {
        return
    }

    alt := resp.Channel.Alternatives[0]

    d.eventChan <- TranscriptEvent{
        Text:       alt.Transcript,
        IsFinal:    resp.IsFinal,
        Confidence: alt.Confidence,
        Words:      alt.Words,
    }
}

Model Comparison

Model	Speed	Accuracy	Cost	Use Case
nova-3	⚡⚡⚡	⭐⭐⭐⭐⭐	$0.0043/min	Production (recommended)
nova-2	⚡⚡⚡	⭐⭐⭐⭐	$0.0043/min	Legacy support
enhanced	⚡⚡	⭐⭐⭐⭐	$0.0145/min	Phone audio
base	⚡⚡⚡	⭐⭐⭐	$0.0125/min	Cost-sensitive

Language Support

Tier 1 (Excellent)

Language	Code	Accuracy
English (US)	en-US	⭐⭐⭐⭐⭐
English (UK)	en-GB	⭐⭐⭐⭐⭐
English (AU)	en-AU	⭐⭐⭐⭐⭐
Spanish	es	⭐⭐⭐⭐⭐
French	fr	⭐⭐⭐⭐⭐
German	de	⭐⭐⭐⭐⭐
Portuguese	pt	⭐⭐⭐⭐⭐

Tier 2 (Good)

Language	Code	Accuracy
Hindi	hi	⭐⭐⭐
Japanese	ja	⭐⭐⭐⭐
Korean	ko	⭐⭐⭐⭐
Chinese	zh	⭐⭐⭐⭐
Dutch	nl	⭐⭐⭐⭐
Italian	it	⭐⭐⭐⭐

Custom Vocabulary

Add domain-specific terms for better accuracy:

{
  "sttConfig": {
    "keywords": [
      "Edesy:2",
      "voice agent:2",
      "STT:1.5",
      "TTS:1.5"
    ]
  }
}

The number after the colon is a boost factor (0.0-3.0). Higher values make Deepgram more likely to recognize that term.

Error Handling

func (d *DeepgramSTT) handleError(err error) {
    var wsErr *websocket.CloseError
    if errors.As(err, &wsErr) {
        switch wsErr.Code {
        case 1008: // Policy Violation
            log.Error("Deepgram: Invalid API key or quota exceeded")
            // Switch to fallback provider
        case 1011: // Internal Error
            log.Error("Deepgram: Server error, reconnecting...")
            d.reconnect()
        }
    }
}

Best Practices

1. Use Interim Results for UX

// Show "thinking" indicator during speech
for event := range stt.Events() {
    if !event.IsFinal && len(event.Text) > 0 {
        ui.ShowTypingIndicator()
    }
}

2. Handle Network Issues

// Automatic reconnection with backoff
func (d *DeepgramSTT) reconnect() {
    backoff := 100 * time.Millisecond
    maxBackoff := 5 * time.Second

    for {
        err := d.Connect(context.Background())
        if err == nil {
            return
        }

        time.Sleep(backoff)
        backoff = min(backoff*2, maxBackoff)
    }
}

3. Monitor Performance

// Track key metrics
metrics.RecordHistogram("stt.deepgram.latency_ms", latency.Milliseconds())
metrics.RecordCounter("stt.deepgram.transcripts_total", 1)
metrics.RecordHistogram("stt.deepgram.confidence", confidence)

Troubleshooting

Issue	Cause	Solution
High latency	Wrong endpoint region	Use nearest regional endpoint
Poor accuracy	Wrong language code	Verify BCP-47 language code
No interim results	Parameter not set	Add `interim_results=true`
Cut-off speech	Endpointing too aggressive	Increase `endpointing` value
Missing words	Audio too quiet	Check audio levels, add volume normalization

Next Steps

Google Chirp - For Indic languages
Endpointing Guide - Fine-tune speech detection
Latency Optimization - Reduce response time