STT Providers Overview

Speech-to-Text (STT) converts the caller's audio into text that the LLM can understand. Choosing the right STT provider significantly impacts accuracy, latency, and cost.

Supported Providers

Provider	Model	Latency	Languages	Best For
Deepgram	Nova-3	⚡ Fastest	35+	Production voice agents
Google	Chirp 2	🚀 Fast	125+	Indic languages
Azure	Neural	🚀 Fast	100+	Enterprise
ElevenLabs	Scribe	🚀 Fast	99+	Regional languages
AssemblyAI	Universal-2	🚀 Fast	50+	Accuracy-focused
OpenAI	Whisper	🐢 Moderate	100+	Multilingual

Quick Comparison

Latency (Time to First Partial, lower is better):
──────────────────────────────────────────────────────────────────

Deepgram Nova-3   ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  80ms
Google Chirp      ████████░░░░░░░░░░░░░░░░░░░░░░░░  120ms
Azure Neural      ████████░░░░░░░░░░░░░░░░░░░░░░░░  130ms
ElevenLabs        ██████████░░░░░░░░░░░░░░░░░░░░░░  150ms
AssemblyAI        ████████████░░░░░░░░░░░░░░░░░░░░  180ms
OpenAI Whisper    ████████████████████████████████  300ms

                  0ms              150ms            300ms

Choosing the Right Provider

For Low Latency (Recommended)

Deepgram Nova-3

Fastest time-to-first-partial (~80ms)
Excellent for real-time voice agents
Best endpointing (detects speech completion)
Smart formatting included

{
  "sttProvider": "deepgram",
  "sttModel": "nova-3"
}

For Indic Languages

Google Chirp 2

Best accuracy for Hindi, Tamil, Telugu, Bengali
125+ languages supported
Chirp 2 is optimized for telephony

{
  "sttProvider": "google",
  "sttModel": "chirp_2"
}

For Regional Languages

ElevenLabs Scribe

Excellent for Assamese, Odia, Punjabi
99 languages with good regional coverage
Competitive pricing

{
  "sttProvider": "elevenlabs"
}

For Enterprise

Azure Neural

Enterprise SLAs and compliance
Custom speech models available
Global deployment options

{
  "sttProvider": "azure"
}

Streaming vs Batch

All our STT integrations use streaming for real-time voice agents:

Batch STT (not suitable for voice):
────────────────────────────────────
User speaks for 5 seconds → Wait → Get full transcript

Streaming STT (what we use):
────────────────────────────────────
User speaks → Interim results every 100ms → Final transcript
                    │
                    └── LLM can start preparing response

Interim Results

Interim results allow faster response preparation:

// STT emits interim results as user speaks
type TranscriptEvent struct {
    Text      string
    IsFinal   bool
    Stability float32  // 0.0-1.0, higher = more stable
}

// Example stream for "What is my order status?"
// t=100ms: {Text: "What", IsFinal: false, Stability: 0.8}
// t=200ms: {Text: "What is", IsFinal: false, Stability: 0.85}
// t=300ms: {Text: "What is my", IsFinal: false, Stability: 0.9}
// t=500ms: {Text: "What is my order", IsFinal: false, Stability: 0.92}
// t=700ms: {Text: "What is my order status", IsFinal: true, Stability: 1.0}

Endpointing

Endpointing detects when the user has finished speaking:

Provider	Endpointing	Configurable
Deepgram	⭐⭐⭐⭐⭐ Smart	Yes
Google	⭐⭐⭐⭐ Good	Yes
Azure	⭐⭐⭐⭐ Good	Yes
ElevenLabs	⭐⭐⭐ Basic	Limited
AssemblyAI	⭐⭐⭐⭐ Good	Yes

Deepgram Endpointing Configuration

{
  "sttProvider": "deepgram",
  "sttConfig": {
    "endpointing": 300,
    "utterance_end_ms": 1000,
    "interim_results": true
  }
}

Parameter	Default	Description
`endpointing`	300	Silence duration (ms) to trigger is_final
`utterance_end_ms`	1000	Maximum wait for speech completion
`interim_results`	true	Enable real-time partial transcripts

Language Support Matrix

Language	Deepgram	Google	Azure	ElevenLabs	AssemblyAI
English (US)	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Hindi	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Spanish	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
French	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
German	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Tamil	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐
Telugu	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐
Bengali	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐
Assamese	❌	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐	❌
Japanese	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐

Cost Comparison

Provider	Cost per Minute	Monthly (1M minutes)
Deepgram Nova-3	$0.0043	$4,300
Google Chirp	$0.016	$16,000
Azure Neural	$0.016	$16,000
ElevenLabs	$0.007	$7,000
AssemblyAI	$0.0055	$5,500
OpenAI Whisper	$0.006	$6,000

Audio Requirements

Parameter	Requirement	Notes
Sample Rate	8000 Hz	Telephony standard (μ-law)
Channels	Mono	Single channel
Bit Depth	16-bit	Linear PCM
Encoding	Linear16 or μ-law	Provider dependent

Audio Conversion

// Telephony sends μ-law, STT may need Linear16
func mulawToLinear16(mulaw []byte) []int16 {
    linear := make([]int16, len(mulaw))
    for i, sample := range mulaw {
        linear[i] = mulawToLinearSample(sample)
    }
    return linear
}

Error Handling

Handle STT failures gracefully:

func handleSTTError(err error) {
    switch {
    case errors.Is(err, ErrAudioTooQuiet):
        // Ask user to speak louder
        tts.Speak("I'm having trouble hearing you. Could you speak a bit louder?")

    case errors.Is(err, ErrConnectionLost):
        // Reconnect automatically
        stt.Reconnect()

    case errors.Is(err, ErrRateLimited):
        // Use fallback provider
        stt.SwitchProvider("fallback")
    }
}

Next Steps

Deepgram Configuration - Fastest STT for voice
Google Chirp - Best for Indic languages
Azure Speech - Enterprise deployment
ElevenLabs Scribe - Regional languages