Google Chirp STT
Google Cloud Speech-to-Text with Chirp 2 provides excellent accuracy for 125+ languages, with particular strength in Indic languages.
Why Google Chirp?
| Feature |
Google Chirp 2 |
Deepgram Nova-3 |
| Languages |
125+ |
35+ |
| Indic Language Accuracy |
⭐⭐⭐⭐⭐ |
⭐⭐⭐ |
| Time to First Partial |
~120ms |
~80ms |
| Streaming |
Full support |
Full support |
| Cost |
$0.016/min |
$0.0043/min |
Best for: Hindi, Tamil, Telugu, Bengali, and other Indic languages.
Configuration
Basic Setup
{
"agent": {
"name": "Hindi Support",
"language": "hi-IN",
"sttProvider": "google",
"sttModel": "chirp_2"
}
}
Environment Variables
GOOGLE_CREDENTIALS_PATH=/path/to/service-account.json
# Or
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
Advanced Configuration
{
"sttProvider": "google",
"sttModel": "chirp_2",
"sttConfig": {
"languageCode": "hi-IN",
"alternativeLanguageCodes": ["en-IN"],
"enableAutomaticPunctuation": true,
"enableSpokenPunctuation": false,
"enableSpokenEmojis": false,
"model": "chirp_2",
"useEnhanced": true,
"singleUtterance": false,
"interimResults": true
}
}
Model Comparison
| Model |
Accuracy |
Latency |
Languages |
Use Case |
| chirp_2 |
⭐⭐⭐⭐⭐ |
Fast |
125+ |
Indic languages |
| chirp |
⭐⭐⭐⭐ |
Fast |
100+ |
General multilingual |
| latest_long |
⭐⭐⭐⭐ |
Moderate |
125+ |
Long-form audio |
| latest_short |
⭐⭐⭐⭐ |
Fast |
125+ |
Short utterances |
| telephony |
⭐⭐⭐ |
Fast |
50+ |
Phone audio quality |
| command_and_search |
⭐⭐⭐ |
Fastest |
50+ |
Commands only |
Implementation
Streaming Recognition
type GoogleSTT struct {
client *speech.Client
config *speechpb.RecognitionConfig
language string
}
func NewGoogleSTT(language string) (*GoogleSTT, error) {
ctx := context.Background()
client, err := speech.NewClient(ctx)
if err != nil {
return nil, err
}
config := &speechpb.RecognitionConfig{
Encoding: speechpb.RecognitionConfig_LINEAR16,
SampleRateHertz: 8000,
LanguageCode: language,
Model: "chirp_2",
UseEnhanced: true,
EnableAutomaticPunctuation: true,
}
return &GoogleSTT{
client: client,
config: config,
language: language,
}, nil
}
func (g *GoogleSTT) StreamRecognize(ctx context.Context) (*StreamSession, error) {
stream, err := g.client.StreamingRecognize(ctx)
if err != nil {
return nil, err
}
// Send initial config
streamingConfig := &speechpb.StreamingRecognitionConfig{
Config: g.config,
InterimResults: true,
SingleUtterance: false,
}
if err := stream.Send(&speechpb.StreamingRecognizeRequest{
StreamingRequest: &speechpb.StreamingRecognizeRequest_StreamingConfig{
StreamingConfig: streamingConfig,
},
}); err != nil {
return nil, err
}
return &StreamSession{stream: stream}, nil
}
Sending Audio
func (s *StreamSession) SendAudio(audio []byte) error {
return s.stream.Send(&speechpb.StreamingRecognizeRequest{
StreamingRequest: &speechpb.StreamingRecognizeRequest_AudioContent{
AudioContent: audio,
},
})
}
Receiving Results
func (s *StreamSession) ReceiveResults() <-chan TranscriptEvent {
results := make(chan TranscriptEvent)
go func() {
defer close(results)
for {
resp, err := s.stream.Recv()
if err == io.EOF {
return
}
if err != nil {
return
}
for _, result := range resp.Results {
if len(result.Alternatives) == 0 {
continue
}
alt := result.Alternatives[0]
results <- TranscriptEvent{
Text: alt.Transcript,
Confidence: alt.Confidence,
IsFinal: result.IsFinal,
Stability: result.Stability,
}
}
}
}()
return results
}
Language Support
Indic Languages (Excellent)
| Language |
Code |
Accuracy |
Notes |
| Hindi |
hi-IN |
⭐⭐⭐⭐⭐ |
Best-in-class |
| Bengali |
bn-IN |
⭐⭐⭐⭐⭐ |
Excellent |
| Tamil |
ta-IN |
⭐⭐⭐⭐⭐ |
Excellent |
| Telugu |
te-IN |
⭐⭐⭐⭐⭐ |
Excellent |
| Marathi |
mr-IN |
⭐⭐⭐⭐ |
Very good |
| Gujarati |
gu-IN |
⭐⭐⭐⭐ |
Very good |
| Kannada |
kn-IN |
⭐⭐⭐⭐ |
Very good |
| Malayalam |
ml-IN |
⭐⭐⭐⭐ |
Very good |
| Punjabi |
pa-IN |
⭐⭐⭐⭐ |
Very good |
| Odia |
or-IN |
⭐⭐⭐ |
Good |
| Assamese |
as-IN |
⭐⭐⭐ |
Good |
Multi-Language Detection
{
"sttConfig": {
"languageCode": "hi-IN",
"alternativeLanguageCodes": ["en-IN", "mr-IN"],
"enableLanguageIdentification": true
}
}
Speech Adaptation
Improve accuracy for domain-specific terms:
config := &speechpb.RecognitionConfig{
// ... base config
Adaptation: &speechpb.SpeechAdaptation{
PhraseSets: []*speechpb.SpeechAdaptation_AdaptationPhraseSet{
{
Value: &speechpb.SpeechAdaptation_AdaptationPhraseSet_InlinePhraseSet{
InlinePhraseSet: &speechpb.PhraseSet{
Phrases: []*speechpb.PhraseSet_Phrase{
{Value: "Edesy", Boost: 20},
{Value: "voice agent", Boost: 15},
{Value: "STT", Boost: 10},
},
},
},
},
},
},
}
Endpointing Configuration
streamingConfig := &speechpb.StreamingRecognitionConfig{
Config: config,
StreamingFeatures: &speechpb.StreamingRecognitionFeatures{
InterimResults: true,
VoiceActivityTimeout: &speechpb.StreamingRecognitionFeatures_VoiceActivityTimeout{
SpeechStartTimeout: durationpb.New(5 * time.Second),
SpeechEndTimeout: durationpb.New(1 * time.Second),
},
},
}
Error Handling
func (g *GoogleSTT) handleError(err error) {
status, ok := status.FromError(err)
if !ok {
log.Printf("Unknown error: %v", err)
return
}
switch status.Code() {
case codes.InvalidArgument:
log.Printf("Invalid audio format or config")
case codes.ResourceExhausted:
log.Printf("Quota exceeded, implement backoff")
case codes.Unavailable:
log.Printf("Service unavailable, reconnecting...")
g.reconnect()
case codes.DeadlineExceeded:
log.Printf("Request timeout")
}
}
Cost Optimization
Pricing (per minute)
| Model |
Standard |
Data Logging |
| Chirp 2 |
$0.016 |
$0.012 |
| Enhanced |
$0.024 |
$0.018 |
| Standard |
$0.006 |
$0.004 |
Optimization Tips
- Use appropriate model: Chirp 2 for Indic, standard for English
- Enable data logging: 25% cost reduction
- Batch short utterances: Minimum billing is 15 seconds
// Track usage for cost monitoring
func (g *GoogleSTT) trackUsage(audioLength time.Duration) {
// Minimum billing is 15 seconds
billedSeconds := max(15, int(audioLength.Seconds()))
metrics.RecordCounter("stt.google.billed_seconds", int64(billedSeconds))
metrics.RecordCounter("stt.google.cost_usd", float64(billedSeconds)/60*0.016)
}
Best Practices
1. Handle Streaming Limits
Google limits streaming sessions to 5 minutes:
func (g *GoogleSTT) maintainSession(ctx context.Context) {
ticker := time.NewTicker(4 * time.Minute)
defer ticker.Stop()
for {
select {
case <-ticker.C:
// Reconnect before 5-minute limit
g.reconnect()
case <-ctx.Done():
return
}
}
}
2. Use Single Utterance for Short Commands
{
"sttConfig": {
"singleUtterance": true
}
}
3. Enable Enhanced Model for Telephony
{
"sttConfig": {
"model": "telephony",
"useEnhanced": true
}
}
Troubleshooting
| Issue |
Cause |
Solution |
| No results |
Wrong audio format |
Verify LINEAR16, 8kHz mono |
| Low accuracy |
Wrong model |
Use chirp_2 for Indic |
| Session timeout |
5-minute limit |
Implement auto-reconnect |
| High latency |
Network issues |
Use regional endpoint |
Next Steps