Google Chirp STT

Google Cloud Speech-to-Text with Chirp 2 provides excellent accuracy for 125+ languages, with particular strength in Indic languages.

Why Google Chirp?

Feature	Google Chirp 2	Deepgram Nova-3
Languages	125+	35+
Indic Language Accuracy	⭐⭐⭐⭐⭐	⭐⭐⭐
Time to First Partial	~120ms	~80ms
Streaming	Full support	Full support
Cost	$0.016/min	$0.0043/min

Best for: Hindi, Tamil, Telugu, Bengali, and other Indic languages.

Configuration

Basic Setup

{
  "agent": {
    "name": "Hindi Support",
    "language": "hi-IN",
    "sttProvider": "google",
    "sttModel": "chirp_2"
  }
}

Environment Variables

GOOGLE_CREDENTIALS_PATH=/path/to/service-account.json
# Or
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Advanced Configuration

{
  "sttProvider": "google",
  "sttModel": "chirp_2",
  "sttConfig": {
    "languageCode": "hi-IN",
    "alternativeLanguageCodes": ["en-IN"],
    "enableAutomaticPunctuation": true,
    "enableSpokenPunctuation": false,
    "enableSpokenEmojis": false,
    "model": "chirp_2",
    "useEnhanced": true,
    "singleUtterance": false,
    "interimResults": true
  }
}

Model Comparison

Model	Accuracy	Latency	Languages	Use Case
chirp_2	⭐⭐⭐⭐⭐	Fast	125+	Indic languages
chirp	⭐⭐⭐⭐	Fast	100+	General multilingual
latest_long	⭐⭐⭐⭐	Moderate	125+	Long-form audio
latest_short	⭐⭐⭐⭐	Fast	125+	Short utterances
telephony	⭐⭐⭐	Fast	50+	Phone audio quality
command_and_search	⭐⭐⭐	Fastest	50+	Commands only

Implementation

Streaming Recognition

type GoogleSTT struct {
    client   *speech.Client
    config   *speechpb.RecognitionConfig
    language string
}

func NewGoogleSTT(language string) (*GoogleSTT, error) {
    ctx := context.Background()
    client, err := speech.NewClient(ctx)
    if err != nil {
        return nil, err
    }

    config := &speechpb.RecognitionConfig{
        Encoding:                   speechpb.RecognitionConfig_LINEAR16,
        SampleRateHertz:            8000,
        LanguageCode:               language,
        Model:                      "chirp_2",
        UseEnhanced:                true,
        EnableAutomaticPunctuation: true,
    }

    return &GoogleSTT{
        client:   client,
        config:   config,
        language: language,
    }, nil
}

func (g *GoogleSTT) StreamRecognize(ctx context.Context) (*StreamSession, error) {
    stream, err := g.client.StreamingRecognize(ctx)
    if err != nil {
        return nil, err
    }

    // Send initial config
    streamingConfig := &speechpb.StreamingRecognitionConfig{
        Config:          g.config,
        InterimResults:  true,
        SingleUtterance: false,
    }

    if err := stream.Send(&speechpb.StreamingRecognizeRequest{
        StreamingRequest: &speechpb.StreamingRecognizeRequest_StreamingConfig{
            StreamingConfig: streamingConfig,
        },
    }); err != nil {
        return nil, err
    }

    return &StreamSession{stream: stream}, nil
}

Sending Audio

func (s *StreamSession) SendAudio(audio []byte) error {
    return s.stream.Send(&speechpb.StreamingRecognizeRequest{
        StreamingRequest: &speechpb.StreamingRecognizeRequest_AudioContent{
            AudioContent: audio,
        },
    })
}

Receiving Results

func (s *StreamSession) ReceiveResults() <-chan TranscriptEvent {
    results := make(chan TranscriptEvent)

    go func() {
        defer close(results)

        for {
            resp, err := s.stream.Recv()
            if err == io.EOF {
                return
            }
            if err != nil {
                return
            }

            for _, result := range resp.Results {
                if len(result.Alternatives) == 0 {
                    continue
                }

                alt := result.Alternatives[0]
                results <- TranscriptEvent{
                    Text:       alt.Transcript,
                    Confidence: alt.Confidence,
                    IsFinal:    result.IsFinal,
                    Stability:  result.Stability,
                }
            }
        }
    }()

    return results
}

Language Support

Indic Languages (Excellent)

Language	Code	Accuracy	Notes
Hindi	hi-IN	⭐⭐⭐⭐⭐	Best-in-class
Bengali	bn-IN	⭐⭐⭐⭐⭐	Excellent
Tamil	ta-IN	⭐⭐⭐⭐⭐	Excellent
Telugu	te-IN	⭐⭐⭐⭐⭐	Excellent
Marathi	mr-IN	⭐⭐⭐⭐	Very good
Gujarati	gu-IN	⭐⭐⭐⭐	Very good
Kannada	kn-IN	⭐⭐⭐⭐	Very good
Malayalam	ml-IN	⭐⭐⭐⭐	Very good
Punjabi	pa-IN	⭐⭐⭐⭐	Very good
Odia	or-IN	⭐⭐⭐	Good
Assamese	as-IN	⭐⭐⭐	Good

Multi-Language Detection

{
  "sttConfig": {
    "languageCode": "hi-IN",
    "alternativeLanguageCodes": ["en-IN", "mr-IN"],
    "enableLanguageIdentification": true
  }
}

Speech Adaptation

Improve accuracy for domain-specific terms:

config := &speechpb.RecognitionConfig{
    // ... base config
    Adaptation: &speechpb.SpeechAdaptation{
        PhraseSets: []*speechpb.SpeechAdaptation_AdaptationPhraseSet{
            {
                Value: &speechpb.SpeechAdaptation_AdaptationPhraseSet_InlinePhraseSet{
                    InlinePhraseSet: &speechpb.PhraseSet{
                        Phrases: []*speechpb.PhraseSet_Phrase{
                            {Value: "Edesy", Boost: 20},
                            {Value: "voice agent", Boost: 15},
                            {Value: "STT", Boost: 10},
                        },
                    },
                },
            },
        },
    },
}

Endpointing Configuration

streamingConfig := &speechpb.StreamingRecognitionConfig{
    Config: config,
    StreamingFeatures: &speechpb.StreamingRecognitionFeatures{
        InterimResults: true,
        VoiceActivityTimeout: &speechpb.StreamingRecognitionFeatures_VoiceActivityTimeout{
            SpeechStartTimeout:  durationpb.New(5 * time.Second),
            SpeechEndTimeout:    durationpb.New(1 * time.Second),
        },
    },
}

Error Handling

func (g *GoogleSTT) handleError(err error) {
    status, ok := status.FromError(err)
    if !ok {
        log.Printf("Unknown error: %v", err)
        return
    }

    switch status.Code() {
    case codes.InvalidArgument:
        log.Printf("Invalid audio format or config")
    case codes.ResourceExhausted:
        log.Printf("Quota exceeded, implement backoff")
    case codes.Unavailable:
        log.Printf("Service unavailable, reconnecting...")
        g.reconnect()
    case codes.DeadlineExceeded:
        log.Printf("Request timeout")
    }
}

Cost Optimization

Pricing (per minute)

Model	Standard	Data Logging
Chirp 2	$0.016	$0.012
Enhanced	$0.024	$0.018
Standard	$0.006	$0.004

Optimization Tips

Use appropriate model: Chirp 2 for Indic, standard for English
Enable data logging: 25% cost reduction
Batch short utterances: Minimum billing is 15 seconds

// Track usage for cost monitoring
func (g *GoogleSTT) trackUsage(audioLength time.Duration) {
    // Minimum billing is 15 seconds
    billedSeconds := max(15, int(audioLength.Seconds()))

    metrics.RecordCounter("stt.google.billed_seconds", int64(billedSeconds))
    metrics.RecordCounter("stt.google.cost_usd", float64(billedSeconds)/60*0.016)
}

Best Practices

1. Handle Streaming Limits

Google limits streaming sessions to 5 minutes:

func (g *GoogleSTT) maintainSession(ctx context.Context) {
    ticker := time.NewTicker(4 * time.Minute)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            // Reconnect before 5-minute limit
            g.reconnect()
        case <-ctx.Done():
            return
        }
    }
}

2. Use Single Utterance for Short Commands

{
  "sttConfig": {
    "singleUtterance": true
  }
}

3. Enable Enhanced Model for Telephony

{
  "sttConfig": {
    "model": "telephony",
    "useEnhanced": true
  }
}

Troubleshooting

Issue	Cause	Solution
No results	Wrong audio format	Verify LINEAR16, 8kHz mono
Low accuracy	Wrong model	Use chirp_2 for Indic
Session timeout	5-minute limit	Implement auto-reconnect
High latency	Network issues	Use regional endpoint

Next Steps

Azure Speech - Enterprise alternative
Deepgram - Lower latency option
Language Support - Full language matrix