Azure Speech STT
Azure Speech Services provides enterprise-grade STT with excellent language support, custom model training, and compliance certifications.
Why Azure?
| Feature | Azure Speech | Google Chirp |
|---|---|---|
| Languages | 100+ | 125+ |
| Custom Models | ✅ Yes | Limited |
| On-premises | ✅ Containers | ❌ No |
| Compliance | SOC 2, HIPAA, GDPR | SOC 2, HIPAA |
| Enterprise SLA | 99.9% | 99.9% |
| Cost | $0.016/min | $0.016/min |
Best for: Enterprise deployments, regulated industries, custom vocabulary needs.
Configuration
Basic Setup
{
"agent": {
"name": "Enterprise Support",
"sttProvider": "azure",
"sttConfig": {
"region": "eastus",
"language": "en-US"
}
}
}
Environment Variables
AZURE_SPEECH_API_KEY=your_azure_speech_key
AZURE_SPEECH_REGION=eastus
Advanced Configuration
{
"sttProvider": "azure",
"sttConfig": {
"region": "eastus",
"language": "en-US",
"outputFormat": "detailed",
"profanityOption": "masked",
"enableDictation": false,
"enableInterimResults": true,
"endpointId": "custom-endpoint-id",
"initialSilenceTimeout": 5000,
"endSilenceTimeout": 1000
}
}
Implementation
WebSocket Connection
type AzureSTT struct {
apiKey string
region string
language string
conn *websocket.Conn
eventChan chan TranscriptEvent
}
func (a *AzureSTT) Connect(ctx context.Context) error {
// Build WebSocket URL
wsURL := fmt.Sprintf(
"wss://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=%s&format=detailed",
a.region,
a.language,
)
headers := http.Header{}
headers.Set("Ocp-Apim-Subscription-Key", a.apiKey)
conn, _, err := websocket.DefaultDialer.DialContext(ctx, wsURL, headers)
if err != nil {
return fmt.Errorf("azure connect: %w", err)
}
a.conn = conn
go a.receiveLoop()
return nil
}
Audio Streaming
func (a *AzureSTT) SendAudio(audio []byte) error {
// Azure expects audio in chunks with headers
header := createAudioHeader(len(audio))
// Send header
if err := a.conn.WriteMessage(websocket.BinaryMessage, header); err != nil {
return err
}
// Send audio data
return a.conn.WriteMessage(websocket.BinaryMessage, audio)
}
func createAudioHeader(audioLength int) []byte {
// RIFF header for raw PCM
header := make([]byte, 44)
copy(header[0:4], []byte("RIFF"))
binary.LittleEndian.PutUint32(header[4:8], uint32(audioLength+36))
copy(header[8:12], []byte("WAVE"))
copy(header[12:16], []byte("fmt "))
binary.LittleEndian.PutUint32(header[16:20], 16)
binary.LittleEndian.PutUint16(header[20:22], 1) // PCM
binary.LittleEndian.PutUint16(header[22:24], 1) // Mono
binary.LittleEndian.PutUint32(header[24:28], 8000) // Sample rate
binary.LittleEndian.PutUint32(header[28:32], 16000) // Byte rate
binary.LittleEndian.PutUint16(header[32:34], 2) // Block align
binary.LittleEndian.PutUint16(header[34:36], 16) // Bits per sample
copy(header[36:40], []byte("data"))
binary.LittleEndian.PutUint32(header[40:44], uint32(audioLength))
return header
}
Receiving Results
func (a *AzureSTT) receiveLoop() {
for {
_, msg, err := a.conn.ReadMessage()
if err != nil {
return
}
var response AzureResponse
json.Unmarshal(msg, &response)
switch response.RecognitionStatus {
case "Success":
a.eventChan <- TranscriptEvent{
Text: response.DisplayText,
Confidence: response.NBest[0].Confidence,
IsFinal: true,
}
case "IntermediateResult":
a.eventChan <- TranscriptEvent{
Text: response.Text,
IsFinal: false,
}
case "EndOfDictation":
// Session ended
}
}
}
Custom Speech Models
Train models for domain-specific vocabulary:
Create Custom Model
func createCustomModel(projectID string, trainingData []TrainingItem) error {
client := NewAzureCustomSpeechClient(apiKey, region)
// Upload training data
dataset, err := client.CreateDataset(DatasetParams{
Name: "domain-vocabulary",
Description: "Custom terms for voice agent",
Locale: "en-US",
Kind: "Acoustic",
})
// Train model
model, err := client.CreateModel(ModelParams{
Name: "custom-voice-agent-model",
BaseModel: "en-US-base",
Datasets: []string{dataset.ID},
})
return err
}
Use Custom Endpoint
{
"sttConfig": {
"endpointId": "your-custom-endpoint-id"
}
}
Language Support
Supported Languages (100+)
| Language | Code | Quality |
|---|---|---|
| English (US) | en-US | ⭐⭐⭐⭐⭐ |
| English (UK) | en-GB | ⭐⭐⭐⭐⭐ |
| English (India) | en-IN | ⭐⭐⭐⭐⭐ |
| Hindi | hi-IN | ⭐⭐⭐⭐ |
| Spanish | es-ES | ⭐⭐⭐⭐⭐ |
| French | fr-FR | ⭐⭐⭐⭐⭐ |
| German | de-DE | ⭐⭐⭐⭐⭐ |
| Japanese | ja-JP | ⭐⭐⭐⭐⭐ |
| Chinese | zh-CN | ⭐⭐⭐⭐⭐ |
| Arabic | ar-SA | ⭐⭐⭐⭐ |
Multi-Language Recognition
{
"sttConfig": {
"language": "en-US",
"additionalLanguages": ["es-ES", "fr-FR"],
"languageIdMode": "Continuous"
}
}
Silence Detection
Configure endpointing behavior:
type SilenceConfig struct {
InitialSilenceTimeout time.Duration // Max wait for speech start
EndSilenceTimeout time.Duration // Silence to end utterance
SegmentationSilence time.Duration // Silence between segments
}
// Conservative settings (don't cut off)
conservative := SilenceConfig{
InitialSilenceTimeout: 10 * time.Second,
EndSilenceTimeout: 2 * time.Second,
SegmentationSilence: 1 * time.Second,
}
// Aggressive settings (faster response)
aggressive := SilenceConfig{
InitialSilenceTimeout: 5 * time.Second,
EndSilenceTimeout: 500 * time.Millisecond,
SegmentationSilence: 300 * time.Millisecond,
}
Profanity Handling
{
"sttConfig": {
"profanityOption": "masked" // "raw", "masked", "removed"
}
}
| Option | Result |
|---|---|
| raw | Full text: "What the hell" |
| masked | Censored: "What the ****" |
| removed | Filtered: "What the" |
Error Handling
func (a *AzureSTT) handleError(response AzureResponse) {
switch response.RecognitionStatus {
case "NoMatch":
log.Debug("No speech detected")
case "InitialSilenceTimeout":
log.Debug("User didn't speak in time")
case "BabbleTimeout":
log.Warn("Too much background noise")
case "Error":
log.Error("Recognition error: %s", response.ErrorDetails)
a.reconnect()
}
}
On-Premises Deployment
Deploy Azure Speech containers for data residency:
# docker-compose.yml
version: '3'
services:
speech-to-text:
image: mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest
ports:
- "5000:5000"
environment:
- Eula=accept
- Billing=https://eastus.api.cognitive.microsoft.com/
- ApiKey=${AZURE_SPEECH_API_KEY}
volumes:
- ./models:/models
// Connect to local container
config := SpeechConfig{
Endpoint: "ws://localhost:5000",
Language: "en-US",
}
Best Practices
1. Use Phrase Lists
phraseList := speechsdk.NewPhraseListGrammar(recognizer)
phraseList.AddPhrase("Edesy")
phraseList.AddPhrase("voice agent")
phraseList.AddPhrase("STT provider")
2. Handle Connection Timeouts
func (a *AzureSTT) maintainConnection(ctx context.Context) {
// Azure connections timeout after 10 minutes
ticker := time.NewTicker(9 * time.Minute)
defer ticker.Stop()
for {
select {
case <-ticker.C:
a.reconnect()
case <-ctx.Done():
return
}
}
}
3. Regional Endpoints
Use the nearest region for lowest latency:
| Region | Endpoint | Best For |
|---|---|---|
| East US | eastus.stt.speech.microsoft.com | US East Coast |
| West US 2 | westus2.stt.speech.microsoft.com | US West Coast |
| Central India | centralindia.stt.speech.microsoft.com | India |
| UK South | uksouth.stt.speech.microsoft.com | UK/Europe |
| Southeast Asia | southeastasia.stt.speech.microsoft.com | APAC |
Next Steps
- ElevenLabs Scribe - Regional languages
- Custom Models - Train your own
- Enterprise Setup - Compliance guide