LLM Providers Overview
The LLM is the "brain" of your voice agent - it understands user intent and generates appropriate responses. Choosing the right LLM is critical for both quality and latency.
Supported Providers
| Provider | Models | Latency | Cost | Best For |
|---|---|---|---|---|
| Google Gemini | 2.0 Flash, 2.5 Flash-Lite | โก Fastest | ๐ฐ Cheapest | Voice agents |
| Gemini Live | 2.0, 2.5 HD | โกโก Ultra-fast | ๐ฐ๐ฐ | Native audio |
| OpenAI | GPT-4o, GPT-4o-mini | ๐ Fast | ๐ฐ๐ฐ๐ฐ | Complex reasoning |
| Anthropic | Claude 3.5 Sonnet | ๐ Fast | ๐ฐ๐ฐ๐ฐ | Long context |
| Azure OpenAI | GPT-4o, GPT-4o-mini | ๐ Fast | ๐ฐ๐ฐ๐ฐ | Enterprise |
Quick Comparison
Time to First Token (lower is better):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Gemini Live 2.5 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 50ms
Gemini 2.5 Lite โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100ms
Gemini 2.0 Flash โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 150ms
GPT-4o-mini โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 180ms
GPT-4o โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 250ms
Claude 3.5 Sonnet โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 220ms
0ms 150ms 300ms
Choosing the Right Provider
For Voice Agents (Recommended)
Google Gemini 2.5 Flash-Lite
- Fastest time-to-first-token (~100ms)
- Excellent for conversational AI
- Best cost-performance ratio
- 1M token context window
{
"llmProvider": "gemini-2.5",
"llmModel": "gemini-2.5-flash-lite"
}
For Native Audio (Best Latency)
Gemini Live 2.0 / 2.5
- Bypasses STT and TTS entirely
- Audio-to-audio in ~50ms
- Natural voice with emotions
- 30 HD voices (2.5)
{
"llmProvider": "gemini-live-2.5",
"geminiliveVoice": "Kore"
}
For Complex Reasoning
OpenAI GPT-4o
- Best overall reasoning capability
- Function calling reliability
- Multi-modal understanding
- Higher latency (~250ms)
{
"llmProvider": "openai",
"llmModel": "gpt-4o"
}
For Enterprise / Compliance
Azure OpenAI
- Same models as OpenAI
- Enterprise SLAs
- Data residency options
- SOC 2, HIPAA compliant
{
"llmProvider": "openai-azure",
"llmModel": "gpt-4o"
}
Provider Configuration
Basic Setup
{
"agent": {
"name": "Customer Support",
"llmProvider": "gemini-2.5",
"llmModel": "gemini-2.5-flash-lite",
"llmTemperature": 0.7,
"prompt": "You are a helpful customer support agent..."
}
}
Environment Variables
# Google Gemini
GOOGLE_AI_API_KEY=your_google_ai_key
# OpenAI
OPENAI_API_KEY=your_openai_key
# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key
# Azure OpenAI
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_DEPLOYMENT=your-deployment-name
Streaming
All providers support streaming for minimal latency:
// LLM generates tokens one at a time
for token := range llm.StreamGenerate(ctx, messages) {
// Send each token to TTS immediately
tts.QueueText(token)
}
Streaming Timeline
LLM Output: "Your order is on the way and will arrive tomorrow."
โ โ โ โ โ โ
Token 1: "Your" โ โ โ โ โ
Token 2: "order" โ โ โ โ
Token 3: "is" โ โ โ
Token 4: "on" โ โ
Token 5: "the" โ
Token 6: "way..."
TTS starts generating audio from Token 1
User hears audio while LLM is still generating
Function Calling
All supported providers support function/tool calling:
{
"tools": [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Get the status of a customer order",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID to look up"
}
},
"required": ["order_id"]
}
}
}
]
}
Provider Function Calling Reliability
| Provider | Reliability | Notes |
|---|---|---|
| GPT-4o | โญโญโญโญโญ | Best function calling |
| Gemini 2.0 | โญโญโญโญ | Very good |
| Claude 3.5 | โญโญโญโญ | Good |
| GPT-4o-mini | โญโญโญ | Sometimes misses |
Context Management
Context Window Sizes
| Provider | Context Window | Practical Limit |
|---|---|---|
| Gemini 2.0 | 1M tokens | 100K recommended |
| Gemini 1.5 Pro | 2M tokens | 200K recommended |
| GPT-4o | 128K tokens | 32K recommended |
| Claude 3.5 | 200K tokens | 100K recommended |
Optimizing Context
// Keep only recent conversation history
func trimContext(messages []Message, maxTokens int) []Message {
// Always keep system prompt
system := messages[0]
// Keep recent messages within token limit
recent := []Message{system}
tokenCount := countTokens(system.Content)
for i := len(messages) - 1; i >= 1; i-- {
msgTokens := countTokens(messages[i].Content)
if tokenCount + msgTokens > maxTokens {
break
}
recent = append([]Message{messages[i]}, recent[1:]...)
tokenCount += msgTokens
}
return recent
}
Cost Optimization
Cost per 1000 Tokens (Input/Output)
| Provider | Input | Output | Monthly @ 1M calls |
|---|---|---|---|
| Gemini 2.5 Lite | $0.015 | $0.06 | ~$150 |
| Gemini 2.0 Flash | $0.075 | $0.30 | ~$750 |
| GPT-4o-mini | $0.15 | $0.60 | ~$1,500 |
| GPT-4o | $2.50 | $10.00 | ~$25,000 |
Cost Reduction Strategies
- Use Gemini for most calls - 10-100x cheaper than GPT-4o
- Keep prompts short - Every token costs money
- Cache common responses - Don't regenerate identical responses
- Route complex tasks - Use GPT-4o only when needed
Fallback Configuration
Configure fallback providers for reliability:
type LLMFallback struct {
Primary LLMProvider
Secondary LLMProvider
Tertiary LLMProvider
}
func (f *LLMFallback) Generate(ctx context.Context, messages []Message) (string, error) {
response, err := f.Primary.Generate(ctx, messages)
if err == nil {
return response, nil
}
log.Printf("Primary LLM failed: %v, trying secondary", err)
response, err = f.Secondary.Generate(ctx, messages)
if err == nil {
return response, nil
}
log.Printf("Secondary LLM failed: %v, trying tertiary", err)
return f.Tertiary.Generate(ctx, messages)
}
Next Steps
- Gemini Configuration - Set up Google Gemini
- Gemini Live - Native audio-to-audio
- OpenAI Configuration - Set up GPT-4o
- Function Calling - Add tools to your agent