Deepgram STT

Deepgram is our recommended STT provider for English-language voice agents due to its industry-leading latency and accuracy.

Why Deepgram?

Lowest Latency: ~150ms streaming latency
High Accuracy: Nova-3 achieves <5% WER on clean audio
Interim Results: Real-time partial transcriptions
Cost Effective: ~₹0.35/minute

Configuration

Environment Variables

DEEPGRAM_API_KEY=your_deepgram_api_key

Agent Configuration

{
  "sttProvider": "deepgram",
  "sttConfig": {
    "model": "nova-3",
    "language": "en",
    "endpointing": 200,
    "utterance_end_ms": 1000,
    "interim_results": true,
    "smart_format": true,
    "punctuate": true,
    "diarize": false
  }
}

Model Options

Model	Accuracy	Latency	Cost	Best For
`nova-3`	Highest	~150ms	$$$	Production
`nova-2`	High	~150ms	$$	Cost-sensitive
`nova`	Good	~150ms	$	Basic use
`enhanced`	Good	~200ms	$	General
`base`	Basic	~150ms	$	Development

Endpointing Settings

Fine-tune when Deepgram determines the user has finished speaking:

type DeepgramConfig struct {
    Endpointing    int  `json:"endpointing"`     // Silence before final (ms)
    UtteranceEndMs int  `json:"utterance_end_ms"` // Max mid-utterance silence
    InterimResults bool `json:"interim_results"`  // Enable partial results
}

Recommended Settings

For conversational agents (faster response):

{
  "endpointing": 200,
  "utterance_end_ms": 1000,
  "interim_results": true
}

For dictation/longer utterances:

{
  "endpointing": 500,
  "utterance_end_ms": 2000,
  "interim_results": true
}

Streaming Implementation

Our implementation uses WebSocket streaming for lowest latency:

// Simplified streaming flow
func (d *DeepgramSTT) StreamTranscribe(ctx context.Context, audioChan <-chan []byte) <-chan string {
    resultChan := make(chan string)

    go func() {
        // Connect to Deepgram WebSocket
        conn := d.connect(ctx)

        // Send audio chunks
        go func() {
            for audio := range audioChan {
                conn.WriteMessage(websocket.BinaryMessage, audio)
            }
        }()

        // Receive transcriptions
        for {
            _, msg, err := conn.ReadMessage()
            if err != nil {
                break
            }

            var result DeepgramResult
            json.Unmarshal(msg, &result)

            if result.IsFinal {
                resultChan <- result.Transcript
            }
        }
    }()

    return resultChan
}

Interim Results

Interim results allow the LLM to start generating responses before the user finishes speaking:

Timeline:
──────────────────────────────────────────────────▶

User speaking: "What is my order status for twelve thirty four"
                    │           │               │
Interim 1:     "What is"        │               │
Interim 2:          "What is my order"          │
Interim 3:                      "What is my order status for"
Final:                                    "What is my order status for 1234"
                                                │
LLM starts:                            ────────▶ [Can start early!]

Smart Formatting

Enable smart formatting for better transcription quality:

{
  "smart_format": true,
  "punctuate": true,
  "numerals": true
}

Before smart format: "one two three four" After smart format: "1234"

Language Support

Language	Code	Model Support
English	`en`, `en-US`, `en-GB`	All models
Spanish	`es`	Nova-2, Nova-3
French	`fr`	Nova-2, Nova-3
German	`de`	Nova-2, Nova-3
Hindi	`hi`	Limited

Troubleshooting

High Word Error Rate

Check audio quality (sample rate, encoding)
Enable noise reduction on client side
Use nova-3 model for best accuracy

Delayed Transcription

Reduce endpointing value
Enable interim_results
Check network latency to Deepgram

Missing Words

Check for audio clipping
Verify VAD isn't cutting off speech
Increase utterance_end_ms

Pricing

Plan	Cost	Included
Pay-as-you-go	$0.0043/min	-
Growth	$0.0036/min	Support
Enterprise	Custom	SLA, support

Prices as of 2024. Check Deepgram Pricing for current rates.

Next Steps

Google Chirp - Multi-language alternative
TTS Providers - Text-to-Speech options
Latency Optimization - Further improvements