Azure Speech STT

Azure Speech Services provides enterprise-grade STT with excellent language support, custom model training, and compliance certifications.

Why Azure?

Feature	Azure Speech	Google Chirp
Languages	100+	125+
Custom Models	✅ Yes	Limited
On-premises	✅ Containers	❌ No
Compliance	SOC 2, HIPAA, GDPR	SOC 2, HIPAA
Enterprise SLA	99.9%	99.9%
Cost	$0.016/min	$0.016/min

Best for: Enterprise deployments, regulated industries, custom vocabulary needs.

Configuration

Basic Setup

{
  "agent": {
    "name": "Enterprise Support",
    "sttProvider": "azure",
    "sttConfig": {
      "region": "eastus",
      "language": "en-US"
    }
  }
}

Environment Variables

AZURE_SPEECH_API_KEY=your_azure_speech_key
AZURE_SPEECH_REGION=eastus

Advanced Configuration

{
  "sttProvider": "azure",
  "sttConfig": {
    "region": "eastus",
    "language": "en-US",
    "outputFormat": "detailed",
    "profanityOption": "masked",
    "enableDictation": false,
    "enableInterimResults": true,
    "endpointId": "custom-endpoint-id",
    "initialSilenceTimeout": 5000,
    "endSilenceTimeout": 1000
  }
}

Implementation

WebSocket Connection

type AzureSTT struct {
    apiKey    string
    region    string
    language  string
    conn      *websocket.Conn
    eventChan chan TranscriptEvent
}

func (a *AzureSTT) Connect(ctx context.Context) error {
    // Build WebSocket URL
    wsURL := fmt.Sprintf(
        "wss://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=%s&format=detailed",
        a.region,
        a.language,
    )

    headers := http.Header{}
    headers.Set("Ocp-Apim-Subscription-Key", a.apiKey)

    conn, _, err := websocket.DefaultDialer.DialContext(ctx, wsURL, headers)
    if err != nil {
        return fmt.Errorf("azure connect: %w", err)
    }

    a.conn = conn
    go a.receiveLoop()

    return nil
}

Audio Streaming

func (a *AzureSTT) SendAudio(audio []byte) error {
    // Azure expects audio in chunks with headers
    header := createAudioHeader(len(audio))

    // Send header
    if err := a.conn.WriteMessage(websocket.BinaryMessage, header); err != nil {
        return err
    }

    // Send audio data
    return a.conn.WriteMessage(websocket.BinaryMessage, audio)
}

func createAudioHeader(audioLength int) []byte {
    // RIFF header for raw PCM
    header := make([]byte, 44)
    copy(header[0:4], []byte("RIFF"))
    binary.LittleEndian.PutUint32(header[4:8], uint32(audioLength+36))
    copy(header[8:12], []byte("WAVE"))
    copy(header[12:16], []byte("fmt "))
    binary.LittleEndian.PutUint32(header[16:20], 16)
    binary.LittleEndian.PutUint16(header[20:22], 1) // PCM
    binary.LittleEndian.PutUint16(header[22:24], 1) // Mono
    binary.LittleEndian.PutUint32(header[24:28], 8000) // Sample rate
    binary.LittleEndian.PutUint32(header[28:32], 16000) // Byte rate
    binary.LittleEndian.PutUint16(header[32:34], 2) // Block align
    binary.LittleEndian.PutUint16(header[34:36], 16) // Bits per sample
    copy(header[36:40], []byte("data"))
    binary.LittleEndian.PutUint32(header[40:44], uint32(audioLength))
    return header
}

Receiving Results

func (a *AzureSTT) receiveLoop() {
    for {
        _, msg, err := a.conn.ReadMessage()
        if err != nil {
            return
        }

        var response AzureResponse
        json.Unmarshal(msg, &response)

        switch response.RecognitionStatus {
        case "Success":
            a.eventChan <- TranscriptEvent{
                Text:       response.DisplayText,
                Confidence: response.NBest[0].Confidence,
                IsFinal:    true,
            }
        case "IntermediateResult":
            a.eventChan <- TranscriptEvent{
                Text:    response.Text,
                IsFinal: false,
            }
        case "EndOfDictation":
            // Session ended
        }
    }
}

Custom Speech Models

Train models for domain-specific vocabulary:

Create Custom Model

func createCustomModel(projectID string, trainingData []TrainingItem) error {
    client := NewAzureCustomSpeechClient(apiKey, region)

    // Upload training data
    dataset, err := client.CreateDataset(DatasetParams{
        Name:        "domain-vocabulary",
        Description: "Custom terms for voice agent",
        Locale:      "en-US",
        Kind:        "Acoustic",
    })

    // Train model
    model, err := client.CreateModel(ModelParams{
        Name:      "custom-voice-agent-model",
        BaseModel: "en-US-base",
        Datasets:  []string{dataset.ID},
    })

    return err
}

Use Custom Endpoint

{
  "sttConfig": {
    "endpointId": "your-custom-endpoint-id"
  }
}

Language Support

Supported Languages (100+)

Language	Code	Quality
English (US)	en-US	⭐⭐⭐⭐⭐
English (UK)	en-GB	⭐⭐⭐⭐⭐
English (India)	en-IN	⭐⭐⭐⭐⭐
Hindi	hi-IN	⭐⭐⭐⭐
Spanish	es-ES	⭐⭐⭐⭐⭐
French	fr-FR	⭐⭐⭐⭐⭐
German	de-DE	⭐⭐⭐⭐⭐
Japanese	ja-JP	⭐⭐⭐⭐⭐
Chinese	zh-CN	⭐⭐⭐⭐⭐
Arabic	ar-SA	⭐⭐⭐⭐

Multi-Language Recognition

{
  "sttConfig": {
    "language": "en-US",
    "additionalLanguages": ["es-ES", "fr-FR"],
    "languageIdMode": "Continuous"
  }
}

Silence Detection

Configure endpointing behavior:

type SilenceConfig struct {
    InitialSilenceTimeout time.Duration // Max wait for speech start
    EndSilenceTimeout     time.Duration // Silence to end utterance
    SegmentationSilence   time.Duration // Silence between segments
}

// Conservative settings (don't cut off)
conservative := SilenceConfig{
    InitialSilenceTimeout: 10 * time.Second,
    EndSilenceTimeout:     2 * time.Second,
    SegmentationSilence:   1 * time.Second,
}

// Aggressive settings (faster response)
aggressive := SilenceConfig{
    InitialSilenceTimeout: 5 * time.Second,
    EndSilenceTimeout:     500 * time.Millisecond,
    SegmentationSilence:   300 * time.Millisecond,
}

Profanity Handling

{
  "sttConfig": {
    "profanityOption": "masked"  // "raw", "masked", "removed"
  }
}

Option	Result
raw	Full text: "What the hell"
masked	Censored: "What the ****"
removed	Filtered: "What the"

Error Handling

func (a *AzureSTT) handleError(response AzureResponse) {
    switch response.RecognitionStatus {
    case "NoMatch":
        log.Debug("No speech detected")
    case "InitialSilenceTimeout":
        log.Debug("User didn't speak in time")
    case "BabbleTimeout":
        log.Warn("Too much background noise")
    case "Error":
        log.Error("Recognition error: %s", response.ErrorDetails)
        a.reconnect()
    }
}

On-Premises Deployment

Deploy Azure Speech containers for data residency:

# docker-compose.yml
version: '3'
services:
  speech-to-text:
    image: mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest
    ports:
      - "5000:5000"
    environment:
      - Eula=accept
      - Billing=https://eastus.api.cognitive.microsoft.com/
      - ApiKey=${AZURE_SPEECH_API_KEY}
    volumes:
      - ./models:/models

// Connect to local container
config := SpeechConfig{
    Endpoint: "ws://localhost:5000",
    Language: "en-US",
}

Best Practices

1. Use Phrase Lists

phraseList := speechsdk.NewPhraseListGrammar(recognizer)
phraseList.AddPhrase("Edesy")
phraseList.AddPhrase("voice agent")
phraseList.AddPhrase("STT provider")

2. Handle Connection Timeouts

func (a *AzureSTT) maintainConnection(ctx context.Context) {
    // Azure connections timeout after 10 minutes
    ticker := time.NewTicker(9 * time.Minute)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            a.reconnect()
        case <-ctx.Done():
            return
        }
    }
}

3. Regional Endpoints

Use the nearest region for lowest latency:

Region	Endpoint	Best For
East US	eastus.stt.speech.microsoft.com	US East Coast
West US 2	westus2.stt.speech.microsoft.com	US West Coast
Central India	centralindia.stt.speech.microsoft.com	India
UK South	uksouth.stt.speech.microsoft.com	UK/Europe
Southeast Asia	southeastasia.stt.speech.microsoft.com	APAC

Next Steps

ElevenLabs Scribe - Regional languages
Custom Models - Train your own
Enterprise Setup - Compliance guide