Gemini Live (Native Audio)
Gemini Live is a revolutionary approach to voice agents - it processes audio directly without separate STT and TTS steps, resulting in dramatically lower latency and more natural conversations.
What Makes Gemini Live Different
Traditional Pipeline
User Audio → STT (~150ms) → LLM (~200ms) → TTS (~100ms) → Bot Audio
Total: ~450ms
Gemini Live Pipeline
User Audio → Gemini Live (~200ms) → Bot Audio
Total: ~200ms
Result: 50%+ latency reduction
Available Models
| Model | Provider ID | Features | Best For |
|---|---|---|---|
| Gemini 2.0 Flash Live | gemini-live |
7 voices, stable | Production |
| Gemini 2.5 Flash HD | gemini-live-2.5 |
30 HD voices, emotions | Premium experience |
Gemini Live 2.0 vs 2.5
| Feature | 2.0 Flash Live | 2.5 Flash HD |
|---|---|---|
| Voices | 7 standard | 30 HD voices |
| Languages | ~10 | 24 languages |
| Emotion | Basic | Affective Dialog |
| Interruption | Standard | Improved barge-in |
| Audio Quality | Good | HD quality |
| Latency | ~150ms | ~100ms |
| Stability | Proven | Latest |
Configuration
Basic Setup
{
"agent": {
"name": "Voice Assistant",
"llmProvider": "gemini-live-2.5",
"geminiliveVoice": "Kore",
"prompt": "You are a helpful voice assistant..."
}
}
Environment Variables
GOOGLE_AI_API_KEY=your_google_ai_api_key
Available Voices
Gemini Live 2.0 Voices
| Voice | Description |
|---|---|
| Puck | Neutral, versatile |
| Charon | Deep, authoritative |
| Kore | Warm, friendly |
| Fenrir | Energetic |
| Aoede | Clear, professional |
| Leda | Soft, calming |
| Orus | Rich, resonant |
Gemini Live 2.5 HD Voices (30 voices)
| Category | Voices |
|---|---|
| English (US) | Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede |
| English (UK) | Sage, Vale, River, Luna |
| Hindi | Diya, Arjun, Priya |
| Spanish | Carmen, Miguel, Rosa |
| French | Marie, Pierre, Claire |
| German | Hans, Greta |
| Japanese | Yuki, Kenji |
| And more... | 30 total HD voices |
Implementation
WebSocket Connection
Gemini Live uses a bidirectional WebSocket for real-time audio:
type GeminiLiveClient struct {
conn *websocket.Conn
audioIn chan []byte
audioOut chan []byte
textIn chan string
textOut chan string
}
func (c *GeminiLiveClient) Connect(ctx context.Context) error {
// Connect to Gemini Live WebSocket
url := "wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent"
conn, _, err := websocket.DefaultDialer.Dial(url, nil)
if err != nil {
return err
}
c.conn = conn
// Send initial setup message
setup := map[string]any{
"setup": map[string]any{
"model": "models/gemini-2.0-flash-live-001",
"generationConfig": map[string]any{
"responseModalities": []string{"AUDIO"},
"speechConfig": map[string]any{
"voiceConfig": map[string]any{
"prebuiltVoiceConfig": map[string]any{
"voiceName": "Kore",
},
},
},
},
"systemInstruction": map[string]any{
"parts": []map[string]any{
{"text": c.systemPrompt},
},
},
},
}
return c.conn.WriteJSON(setup)
}
Sending Audio
func (c *GeminiLiveClient) SendAudio(audio []byte) error {
// Convert to base64
b64Audio := base64.StdEncoding.EncodeToString(audio)
msg := map[string]any{
"realtimeInput": map[string]any{
"mediaChunks": []map[string]any{
{
"mimeType": "audio/pcm;rate=16000",
"data": b64Audio,
},
},
},
}
return c.conn.WriteJSON(msg)
}
Receiving Audio
func (c *GeminiLiveClient) ReceiveAudio() <-chan []byte {
audioChan := make(chan []byte)
go func() {
defer close(audioChan)
for {
_, msg, err := c.conn.ReadMessage()
if err != nil {
return
}
var response map[string]any
json.Unmarshal(msg, &response)
// Extract audio from server content
if serverContent, ok := response["serverContent"].(map[string]any); ok {
if parts, ok := serverContent["modelTurn"].(map[string]any)["parts"].([]any); ok {
for _, part := range parts {
if p, ok := part.(map[string]any); ok {
if inlineData, ok := p["inlineData"].(map[string]any); ok {
audioB64 := inlineData["data"].(string)
audio, _ := base64.StdEncoding.DecodeString(audioB64)
audioChan <- audio
}
}
}
}
}
}
}()
return audioChan
}
Audio Specifications
| Parameter | Gemini Live Requirement | Telephony Standard |
|---|---|---|
| Sample Rate | 16000 Hz | 8000 Hz |
| Channels | Mono | Mono |
| Bit Depth | 16-bit PCM | 16-bit PCM |
| Encoding | Linear PCM | μ-law (Twilio) |
Audio Conversion
// Upsample 8kHz to 16kHz for Gemini
func upsample8to16(input []int16) []int16 {
output := make([]int16, len(input)*2)
for i, sample := range input {
output[i*2] = sample
output[i*2+1] = sample // Simple duplication
}
return output
}
// Downsample 16kHz to 8kHz for telephony
func downsample16to8(input []int16) []int16 {
output := make([]int16, len(input)/2)
for i := 0; i < len(output); i++ {
// Average two samples
output[i] = int16((int32(input[i*2]) + int32(input[i*2+1])) / 2)
}
return output
}
Affective Dialog (2.5 HD)
Gemini Live 2.5 supports emotional awareness:
// The model can express and detect emotions
prompt := `You are a warm, empathetic customer support agent.
When the customer sounds frustrated, acknowledge their feelings.
When they sound happy, match their energy.
Express genuine care in your voice.`
Emotion Handling
Customer (frustrated): "I've been waiting THREE DAYS!"
Gemini 2.5: [Calm, empathetic tone]
"I completely understand your frustration.
Three days is too long, and I apologize.
Let me fix this for you right now."
Customer (happy): "It finally arrived! Thank you!"
Gemini 2.5: [Warm, enthusiastic tone]
"That's wonderful news! I'm so glad it
reached you safely. Enjoy!"
Function Calling with Gemini Live
Gemini Live supports function calling alongside audio:
setup := map[string]any{
"setup": map[string]any{
"model": "models/gemini-2.0-flash-live-001",
"tools": []map[string]any{
{
"functionDeclarations": []map[string]any{
{
"name": "get_order_status",
"description": "Get the status of a customer order",
"parameters": map[string]any{
"type": "object",
"properties": map[string]any{
"order_id": map[string]any{
"type": "string",
"description": "The order ID",
},
},
"required": []string{"order_id"},
},
},
},
},
},
},
}
Interruption Handling
Gemini Live has improved barge-in support:
// Send interrupt signal when user speaks during model output
func (c *GeminiLiveClient) SendInterrupt() error {
msg := map[string]any{
"clientContent": map[string]any{
"turnComplete": true,
},
}
return c.conn.WriteJSON(msg)
}
When to Use Gemini Live
Ideal Use Cases
- ✅ High-volume call centers (latency matters)
- ✅ Simple, conversational interactions
- ✅ Multi-language support needed
- ✅ Emotional/empathetic conversations
- ✅ Real-time voice assistants
Consider Alternatives When
- ❌ Need specific STT features (custom vocabulary)
- ❌ Need specific TTS voices (brand voice)
- ❌ Require transcript processing
- ❌ Complex multi-turn reasoning
- ❌ Need GPT-4o level intelligence
Fallback Strategy
Use Gemini Live as primary with traditional pipeline as fallback:
func processCall(ctx context.Context, user *User) {
// Try Gemini Live first
if agent.LLMProvider == "gemini-live-2.5" {
err := processWithGeminiLive(ctx, user)
if err == nil {
return
}
log.Printf("Gemini Live failed: %v, falling back", err)
}
// Fallback to traditional STT → LLM → TTS
processWithTraditionalPipeline(ctx, user)
}
Next Steps
- Gemini 2.0/2.5 Flash - Traditional Gemini setup
- Latency Optimization - Further improvements
- Function Calling - Add tools