STT Providers Overview

Speech-to-Text (STT) is a critical component of your voice agent. The right choice depends on your language requirements, latency needs, and budget.

Supported Providers

Provider	Languages	Latency	Cost/min	Best For
Deepgram	30+	~150ms	~₹0.35	Low latency, English
Google Chirp	100+	~200ms	~₹1.34	Multi-language
Azure Speech	100+	~180ms	~₹0.84	Enterprise, Indic
ElevenLabs Scribe	30+	~250ms	~₹0.56	Indic languages
AssemblyAI	10+	~200ms	~₹0.63	Accuracy
OpenAI Whisper	50+	~300ms	~₹0.50	Quality over speed

Choosing a Provider

For Lowest Latency

Recommended: Deepgram Nova-3

Industry-leading streaming latency (~150ms)
Excellent English accuracy
Interim results for faster response

For Indian Languages

Recommended: ElevenLabs Scribe or Google Chirp

Hindi, Tamil, Telugu, Assamese support
Reasonable accuracy (10-25% WER)
Good cost-performance ratio

For Enterprise

Recommended: Azure Speech

SOC 2, HIPAA compliance
Excellent Indic language support
Custom model training available

Configuration

Agent-Level Configuration

Each agent can have its own STT provider:

{
  "name": "Hindi Support Agent",
  "language": "hi",
  "sttProvider": "elevenlabs",
  "sttModel": "scribe",
  "sttConfig": {
    "endpointing": 300,
    "utterance_end_ms": 1500
  }
}

Provider-Specific Settings

Deepgram

{
  "sttProvider": "deepgram",
  "sttConfig": {
    "model": "nova-3",
    "endpointing": 200,
    "utterance_end_ms": 1000,
    "interim_results": true,
    "smart_format": true
  }
}

Google Chirp

{
  "sttProvider": "google",
  "sttConfig": {
    "model": "chirp",
    "enable_automatic_punctuation": true,
    "enable_spoken_punctuation": false
  }
}

Latency Comparison

User finishes speaking
        │
        ▼
┌───────────────────────────────────────────────┐
│ Deepgram Nova-3   ████████░░░░░░░░ 150ms      │
│ Azure Speech      █████████░░░░░░░ 180ms      │
│ Google Chirp      ██████████░░░░░░ 200ms      │
│ AssemblyAI        ██████████░░░░░░ 200ms      │
│ ElevenLabs Scribe ████████████░░░░ 250ms      │
│ OpenAI Whisper    ███████████████░ 300ms      │
└───────────────────────────────────────────────┘
                                        → Time

Streaming vs Batch

Streaming STT (Recommended)

Real-time transcription as user speaks
Interim results enable faster LLM response
Lower perceived latency

User: "What is my order—"
STT:  [interim] "What is my"
STT:  [interim] "What is my order"
STT:  [final] "What is my order status?"

Batch STT

Full audio processed at once
Higher accuracy potential
Higher latency

Endpointing Configuration

Endpointing determines when the user has finished speaking:

Setting	Description	Recommended
`endpointing`	Silence before end-of-turn (ms)	200-400ms
`utterance_end_ms`	Max silence within utterance	1000-1500ms
`vad_threshold`	Voice activity threshold	0.7-0.9

Aggressive (faster response):

{
  "endpointing": 200,
  "utterance_end_ms": 800
}

Conservative (more complete sentences):

{
  "endpointing": 400,
  "utterance_end_ms": 1500
}

Cost Optimization

Use appropriate model: Nova-2 is cheaper than Nova-3
Optimize audio: Compress silence, use VAD
Cache common phrases: Skip STT for known patterns
Batch non-real-time: Use batch API for recordings

Next Steps

Deepgram Configuration - Detailed Deepgram setup
TTS Providers - Text-to-Speech options
Latency Optimization - Further reduce response time