Google Gemini LLM

Google Gemini is our recommended LLM for voice agents due to its exceptional speed, low cost, and excellent performance with Indic languages.

Why Gemini?

Feature	Gemini 2.5 Flash-Lite	Gemini 2.0 Flash	GPT-4o
Time to First Token	~100ms	~150ms	~250ms
Cost (per 1M tokens)	$0.075 in / $0.30 out	$0.075 in / $0.30 out	$5 in / $15 out
Context Window	1M tokens	1M tokens	128K
Indic Languages	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐

Result: 60-100x cheaper than GPT-4o with 50% lower latency

Configuration

Basic Setup

{
  "agent": {
    "name": "Customer Support",
    "llmProvider": "gemini-2.5",
    "llmModel": "gemini-2.5-flash-lite",
    "llmTemperature": 0.7,
    "prompt": "You are a helpful customer support agent..."
  }
}

Environment Variables

GOOGLE_AI_API_KEY=your_google_ai_api_key

Advanced Configuration

{
  "llmProvider": "gemini-2.5",
  "llmModel": "gemini-2.5-flash-lite",
  "llmConfig": {
    "temperature": 0.7,
    "maxOutputTokens": 500,
    "topP": 0.95,
    "topK": 40
  }
}

Model Comparison

Model	Provider ID	Speed	Intelligence	Cost	Best For
Gemini 2.5 Flash-Lite	`gemini-2.5`	⚡⚡⚡ Fastest	⭐⭐⭐⭐	💰 Cheapest	Real-time voice agents
Gemini 2.0 Flash	`gemini`	⚡⚡ Fast	⭐⭐⭐⭐	💰 Cheap	Standard voice agents
Gemini 1.5 Pro	`gemini-1.5-pro`	🚀 Moderate	⭐⭐⭐⭐⭐	💰💰	Complex reasoning

When to Use Each

Gemini 2.5 Flash-Lite (Recommended for Voice):
├── Lowest latency (~100ms TTFT)
├── Best cost-performance ratio
├── Excellent for simple to moderate tasks
└── 1M token context window

Gemini 2.0 Flash:
├── Proven stability
├── Supports Gemini Live (native audio)
├── Great for Indic languages
└── Good balance of speed and capability

Gemini 1.5 Pro:
├── Best reasoning capabilities
├── 2M token context window
├── Complex multi-step tasks
└── Higher latency (not ideal for voice)

Implementation

Streaming Response

type GeminiLLM struct {
    client *genai.Client
    model  string
}

func (g *GeminiLLM) StreamGenerate(ctx context.Context, messages []Message) <-chan string {
    tokenChan := make(chan string)

    go func() {
        defer close(tokenChan)

        model := g.client.GenerativeModel(g.model)
        model.SetTemperature(0.7)

        // Convert messages to Gemini format
        var parts []genai.Part
        for _, msg := range messages {
            parts = append(parts, genai.Text(msg.Content))
        }

        iter := model.GenerateContentStream(ctx, parts...)

        for {
            resp, err := iter.Next()
            if err == iterator.Done {
                return
            }
            if err != nil {
                return
            }

            for _, candidate := range resp.Candidates {
                for _, part := range candidate.Content.Parts {
                    if text, ok := part.(genai.Text); ok {
                        tokenChan <- string(text)
                    }
                }
            }
        }
    }()

    return tokenChan
}

Function Calling

func (g *GeminiLLM) GenerateWithTools(ctx context.Context, messages []Message, tools []Tool) (*Response, error) {
    model := g.client.GenerativeModel(g.model)

    // Convert tools to Gemini format
    geminiTools := []*genai.Tool{
        {
            FunctionDeclarations: convertToGeminiFunctions(tools),
        },
    }
    model.Tools = geminiTools

    // Generate response
    resp, err := model.GenerateContent(ctx, genai.Text(messages[len(messages)-1].Content))
    if err != nil {
        return nil, err
    }

    // Check for function calls
    for _, candidate := range resp.Candidates {
        for _, part := range candidate.Content.Parts {
            if fc, ok := part.(genai.FunctionCall); ok {
                return &Response{
                    ToolCalls: []ToolCall{{
                        Name:      fc.Name,
                        Arguments: fc.Args,
                    }},
                }, nil
            }
        }
    }

    // Return text response
    return extractTextResponse(resp), nil
}

Indic Language Excellence

Gemini excels at Indian languages:

Supported Languages

Language	Code	Quality
Hindi	hi	⭐⭐⭐⭐⭐
Bengali	bn	⭐⭐⭐⭐⭐
Tamil	ta	⭐⭐⭐⭐⭐
Telugu	te	⭐⭐⭐⭐⭐
Marathi	mr	⭐⭐⭐⭐
Gujarati	gu	⭐⭐⭐⭐
Kannada	kn	⭐⭐⭐⭐
Malayalam	ml	⭐⭐⭐⭐
Punjabi	pa	⭐⭐⭐⭐
Odia	or	⭐⭐⭐
Assamese	as	⭐⭐⭐

Hindi Voice Agent Example

{
  "agent": {
    "name": "Hindi Support",
    "language": "hi-IN",
    "llmProvider": "gemini-2.5",
    "llmModel": "gemini-2.5-flash-lite",
    "sttProvider": "google",
    "sttModel": "chirp_2",
    "ttsProvider": "azure",
    "ttsVoice": "hi-IN-SwaraNeural",
    "prompt": "आप एक मददगार ग्राहक सहायता एजेंट हैं..."
  }
}

Latency Optimization

1. Gemini 2.5 Flash-Lite First

Always try the fastest model first:

func selectGeminiModel(complexity string) string {
    switch complexity {
    case "simple", "moderate":
        return "gemini-2.5-flash-lite" // 100ms TTFT
    case "complex":
        return "gemini-2.0-flash" // 150ms TTFT
    case "reasoning":
        return "gemini-1.5-pro" // Not recommended for voice
    default:
        return "gemini-2.5-flash-lite"
    }
}

2. Pre-warming Connections

// Pre-connect to Gemini on startup
func warmUpGemini(client *genai.Client) {
    model := client.GenerativeModel("gemini-2.5-flash-lite")

    // Send a simple request to warm up the connection
    _, _ = model.GenerateContent(context.Background(), genai.Text("Hi"))
}

3. Streaming with Early TTS

// Start TTS as soon as we get first tokens
func streamToTTS(llmStream <-chan string, tts TTS) {
    var buffer strings.Builder
    tokenCount := 0

    for token := range llmStream {
        buffer.WriteString(token)
        tokenCount++

        // Start TTS after collecting enough for natural speech
        if tokenCount > 5 || strings.ContainsAny(token, ".!?,") {
            text := buffer.String()
            buffer.Reset()
            tokenCount = 0

            tts.StreamSynthesize(text)
        }
    }
}

Safety Settings

Configure content safety for your use case:

model := client.GenerativeModel("gemini-2.5-flash-lite")

model.SafetySettings = []*genai.SafetySetting{
    {
        Category:  genai.HarmCategoryHarassment,
        Threshold: genai.HarmBlockMediumAndAbove,
    },
    {
        Category:  genai.HarmCategoryHateSpeech,
        Threshold: genai.HarmBlockMediumAndAbove,
    },
    {
        Category:  genai.HarmCategoryDangerousContent,
        Threshold: genai.HarmBlockOnlyHigh,
    },
}

Prompt Engineering for Gemini

Voice-Optimized System Prompt

systemPrompt := `You are a helpful customer support agent.

VOICE CONVERSATION RULES:
- Keep responses SHORT (1-2 sentences)
- Use natural, conversational language
- Avoid bullet points and numbered lists
- Say numbers naturally: "one two three" not "123"
- Ask one question at a time
- Confirm before taking any actions

RESPONSE FORMAT:
- Direct, actionable responses
- No emojis or special characters
- No markdown formatting

You have access to these tools:
- get_order_status: Look up order information
- schedule_callback: Schedule a callback
- transfer_call: Transfer to human agent`

Handling Multi-turn Context

func buildGeminiContext(history []Message, userInput string) []genai.Part {
    var parts []genai.Part

    // Add system prompt
    parts = append(parts, genai.Text(systemPrompt))

    // Add conversation history (last 10 turns)
    recentHistory := history
    if len(history) > 20 {
        recentHistory = history[len(history)-20:]
    }

    for _, msg := range recentHistory {
        prefix := "User: "
        if msg.Role == "assistant" {
            prefix = "Assistant: "
        }
        parts = append(parts, genai.Text(prefix + msg.Content))
    }

    // Add current user input
    parts = append(parts, genai.Text("User: " + userInput))

    return parts
}

Error Handling

func (g *GeminiLLM) generateWithRetry(ctx context.Context, messages []Message) (*Response, error) {
    maxRetries := 3
    backoff := 200 * time.Millisecond

    for i := 0; i < maxRetries; i++ {
        resp, err := g.generate(ctx, messages)
        if err == nil {
            return resp, nil
        }

        // Check error type
        if strings.Contains(err.Error(), "429") || strings.Contains(err.Error(), "quota") {
            time.Sleep(backoff)
            backoff *= 2
            continue
        }

        if strings.Contains(err.Error(), "500") || strings.Contains(err.Error(), "503") {
            time.Sleep(backoff)
            continue
        }

        return nil, err // Non-retryable
    }

    return nil, fmt.Errorf("max retries exceeded")
}

Cost Tracking

func (g *GeminiLLM) trackUsage(resp *genai.GenerateContentResponse) {
    if resp.UsageMetadata != nil {
        inputTokens := resp.UsageMetadata.PromptTokenCount
        outputTokens := resp.UsageMetadata.CandidatesTokenCount

        metrics.RecordCounter("llm.gemini.input_tokens", int64(inputTokens))
        metrics.RecordCounter("llm.gemini.output_tokens", int64(outputTokens))

        // Gemini 2.5 Flash-Lite pricing
        inputCost := float64(inputTokens) * 0.000000075   // $0.075/1M tokens
        outputCost := float64(outputTokens) * 0.0000003  // $0.30/1M tokens

        metrics.RecordCounter("llm.gemini.cost_usd", inputCost+outputCost)
    }
}

Fallback to OpenAI

type LLMWithFallback struct {
    gemini *GeminiLLM
    openai *OpenAILLM
}

func (l *LLMWithFallback) Generate(ctx context.Context, messages []Message) (*Response, error) {
    // Try Gemini first (faster, cheaper)
    resp, err := l.gemini.Generate(ctx, messages)
    if err == nil {
        return resp, nil
    }

    log.Printf("Gemini failed: %v, falling back to OpenAI", err)

    // Fallback to OpenAI
    return l.openai.Generate(ctx, messages)
}

Next Steps

Gemini Live - Native audio-to-audio
OpenAI Configuration - For complex reasoning
Function Calling - Add tools
Latency Optimization - Reduce response time