Gemini Live (Native Audio)

Gemini Live is a revolutionary approach to voice agents - it processes audio directly without separate STT and TTS steps, resulting in dramatically lower latency and more natural conversations.

What Makes Gemini Live Different

Traditional Pipeline

User Audio → STT (~150ms) → LLM (~200ms) → TTS (~100ms) → Bot Audio
                              Total: ~450ms

Gemini Live Pipeline

User Audio → Gemini Live (~200ms) → Bot Audio
                  Total: ~200ms

Result: 50%+ latency reduction

Available Models

Model	Provider ID	Features	Best For
Gemini 2.0 Flash Live	`gemini-live`	7 voices, stable	Production
Gemini 2.5 Flash HD	`gemini-live-2.5`	30 HD voices, emotions	Premium experience

Gemini Live 2.0 vs 2.5

Feature	2.0 Flash Live	2.5 Flash HD
Voices	7 standard	30 HD voices
Languages	~10	24 languages
Emotion	Basic	Affective Dialog
Interruption	Standard	Improved barge-in
Audio Quality	Good	HD quality
Latency	~150ms	~100ms
Stability	Proven	Latest

Configuration

Basic Setup

{
  "agent": {
    "name": "Voice Assistant",
    "llmProvider": "gemini-live-2.5",
    "geminiliveVoice": "Kore",
    "prompt": "You are a helpful voice assistant..."
  }
}

Environment Variables

GOOGLE_AI_API_KEY=your_google_ai_api_key

Available Voices

Gemini Live 2.0 Voices

Voice	Description
Puck	Neutral, versatile
Charon	Deep, authoritative
Kore	Warm, friendly
Fenrir	Energetic
Aoede	Clear, professional
Leda	Soft, calming
Orus	Rich, resonant

Gemini Live 2.5 HD Voices (30 voices)

Category	Voices
English (US)	Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede
English (UK)	Sage, Vale, River, Luna
Hindi	Diya, Arjun, Priya
Spanish	Carmen, Miguel, Rosa
French	Marie, Pierre, Claire
German	Hans, Greta
Japanese	Yuki, Kenji
And more...	30 total HD voices

Implementation

WebSocket Connection

Gemini Live uses a bidirectional WebSocket for real-time audio:

type GeminiLiveClient struct {
    conn      *websocket.Conn
    audioIn   chan []byte
    audioOut  chan []byte
    textIn    chan string
    textOut   chan string
}

func (c *GeminiLiveClient) Connect(ctx context.Context) error {
    // Connect to Gemini Live WebSocket
    url := "wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent"

    conn, _, err := websocket.DefaultDialer.Dial(url, nil)
    if err != nil {
        return err
    }
    c.conn = conn

    // Send initial setup message
    setup := map[string]any{
        "setup": map[string]any{
            "model": "models/gemini-2.0-flash-live-001",
            "generationConfig": map[string]any{
                "responseModalities": []string{"AUDIO"},
                "speechConfig": map[string]any{
                    "voiceConfig": map[string]any{
                        "prebuiltVoiceConfig": map[string]any{
                            "voiceName": "Kore",
                        },
                    },
                },
            },
            "systemInstruction": map[string]any{
                "parts": []map[string]any{
                    {"text": c.systemPrompt},
                },
            },
        },
    }

    return c.conn.WriteJSON(setup)
}

Sending Audio

func (c *GeminiLiveClient) SendAudio(audio []byte) error {
    // Convert to base64
    b64Audio := base64.StdEncoding.EncodeToString(audio)

    msg := map[string]any{
        "realtimeInput": map[string]any{
            "mediaChunks": []map[string]any{
                {
                    "mimeType": "audio/pcm;rate=16000",
                    "data":     b64Audio,
                },
            },
        },
    }

    return c.conn.WriteJSON(msg)
}

Receiving Audio

func (c *GeminiLiveClient) ReceiveAudio() <-chan []byte {
    audioChan := make(chan []byte)

    go func() {
        defer close(audioChan)

        for {
            _, msg, err := c.conn.ReadMessage()
            if err != nil {
                return
            }

            var response map[string]any
            json.Unmarshal(msg, &response)

            // Extract audio from server content
            if serverContent, ok := response["serverContent"].(map[string]any); ok {
                if parts, ok := serverContent["modelTurn"].(map[string]any)["parts"].([]any); ok {
                    for _, part := range parts {
                        if p, ok := part.(map[string]any); ok {
                            if inlineData, ok := p["inlineData"].(map[string]any); ok {
                                audioB64 := inlineData["data"].(string)
                                audio, _ := base64.StdEncoding.DecodeString(audioB64)
                                audioChan <- audio
                            }
                        }
                    }
                }
            }
        }
    }()

    return audioChan
}

Audio Specifications

Parameter	Gemini Live Requirement	Telephony Standard
Sample Rate	16000 Hz	8000 Hz
Channels	Mono	Mono
Bit Depth	16-bit PCM	16-bit PCM
Encoding	Linear PCM	μ-law (Twilio)

Audio Conversion

// Upsample 8kHz to 16kHz for Gemini
func upsample8to16(input []int16) []int16 {
    output := make([]int16, len(input)*2)
    for i, sample := range input {
        output[i*2] = sample
        output[i*2+1] = sample // Simple duplication
    }
    return output
}

// Downsample 16kHz to 8kHz for telephony
func downsample16to8(input []int16) []int16 {
    output := make([]int16, len(input)/2)
    for i := 0; i < len(output); i++ {
        // Average two samples
        output[i] = int16((int32(input[i*2]) + int32(input[i*2+1])) / 2)
    }
    return output
}

Affective Dialog (2.5 HD)

Gemini Live 2.5 supports emotional awareness:

// The model can express and detect emotions
prompt := `You are a warm, empathetic customer support agent.
When the customer sounds frustrated, acknowledge their feelings.
When they sound happy, match their energy.
Express genuine care in your voice.`

Emotion Handling

Customer (frustrated): "I've been waiting THREE DAYS!"
Gemini 2.5: [Calm, empathetic tone]
            "I completely understand your frustration.
             Three days is too long, and I apologize.
             Let me fix this for you right now."

Customer (happy): "It finally arrived! Thank you!"
Gemini 2.5: [Warm, enthusiastic tone]
            "That's wonderful news! I'm so glad it
             reached you safely. Enjoy!"

Function Calling with Gemini Live

Gemini Live supports function calling alongside audio:

setup := map[string]any{
    "setup": map[string]any{
        "model": "models/gemini-2.0-flash-live-001",
        "tools": []map[string]any{
            {
                "functionDeclarations": []map[string]any{
                    {
                        "name":        "get_order_status",
                        "description": "Get the status of a customer order",
                        "parameters": map[string]any{
                            "type": "object",
                            "properties": map[string]any{
                                "order_id": map[string]any{
                                    "type":        "string",
                                    "description": "The order ID",
                                },
                            },
                            "required": []string{"order_id"},
                        },
                    },
                },
            },
        },
    },
}

Interruption Handling

Gemini Live has improved barge-in support:

// Send interrupt signal when user speaks during model output
func (c *GeminiLiveClient) SendInterrupt() error {
    msg := map[string]any{
        "clientContent": map[string]any{
            "turnComplete": true,
        },
    }
    return c.conn.WriteJSON(msg)
}

When to Use Gemini Live

Ideal Use Cases

✅ High-volume call centers (latency matters)
✅ Simple, conversational interactions
✅ Multi-language support needed
✅ Emotional/empathetic conversations
✅ Real-time voice assistants

Consider Alternatives When

❌ Need specific STT features (custom vocabulary)
❌ Need specific TTS voices (brand voice)
❌ Require transcript processing
❌ Complex multi-turn reasoning
❌ Need GPT-4o level intelligence

Fallback Strategy

Use Gemini Live as primary with traditional pipeline as fallback:

func processCall(ctx context.Context, user *User) {
    // Try Gemini Live first
    if agent.LLMProvider == "gemini-live-2.5" {
        err := processWithGeminiLive(ctx, user)
        if err == nil {
            return
        }
        log.Printf("Gemini Live failed: %v, falling back", err)
    }

    // Fallback to traditional STT → LLM → TTS
    processWithTraditionalPipeline(ctx, user)
}

Next Steps

Gemini 2.0/2.5 Flash - Traditional Gemini setup
Latency Optimization - Further improvements
Function Calling - Add tools