Core Concepts
Before diving into implementation, it's important to understand the core concepts that make Edesy Voice Agent work.
The Voice Agent Pipeline
A voice agent is essentially a real-time processing pipeline that transforms speech into intelligent responses:
┌─────────────────────────────────────────────────────────────────────┐
│ Voice Agent Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────┐ │
│ │ Audio │──▶│ VAD │──▶│ STT │──▶│ LLM │──▶│ TTS │ │
│ │ Input │ │ │ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └──┬───┘ │
│ │ │ │ │
│ │ Interruption │ │ │
│ └────────────────────────────┘ │ │
│ ▼ │
│ ┌──────────┐│
│ │ Audio ││
│ │ Output ││
│ └──────────┘│
└─────────────────────────────────────────────────────────────────────┘
Frames
Frames are the fundamental unit of data in the pipeline. Everything - audio, text, control signals - flows through the system as frames.
Frame Types
| Frame Type | Purpose | Example |
|---|---|---|
InputAudioFrame |
Raw audio from user | 8kHz PCM audio bytes |
TranscriptionFrame |
Text from STT | "What is my order status?" |
InterimTranscriptionFrame |
Partial STT result | "What is my" |
LLMResponseFrame |
Generated response | "Your order is on the way" |
TTSAudioFrame |
Synthesized speech | Audio bytes |
InterruptionFrame |
User barge-in signal | Cancel current output |
EndFrame |
Call termination | Reason, disposition |
FunctionCallFrame |
Tool invocation | Function name, arguments |
Frame Flow Example
User says: "What's the status of order twelve thirty four?"
Frame 1: InputAudioFrame { audio: [bytes...], timestamp: 0ms }
Frame 2: InputAudioFrame { audio: [bytes...], timestamp: 20ms }
...
Frame N: InterimTranscriptionFrame { text: "What's the status" }
Frame N+1: InterimTranscriptionFrame { text: "What's the status of order" }
Frame N+2: TranscriptionFrame { text: "What's the status of order 1234?", is_final: true }
Frame N+3: LLMResponseFrame { text: "Let me check that for you..." }
Frame N+4: FunctionCallFrame { name: "get_order_status", args: { order_id: "1234" } }
Frame N+5: LLMResponseFrame { text: "Your order 1234 is out for delivery..." }
Frame N+6: TTSAudioFrame { audio: [bytes...] }
Sessions
A session represents a single call/conversation with state:
type Session struct {
// Identity
CallSid string // Unique call identifier
StreamSid string // Audio stream identifier
UserIdPin string // Internal reference
// Configuration
Agent *AgentConfig
Language string
Providers ProviderConfig
// State
Transcript []Message
Variables map[string]string
StartTime time.Time
Status CallStatus
// Channels
STTChannel chan string
TTSChannel chan string
EventChan chan Event
}
Session Lifecycle
1. INITIALIZING
└── WebSocket connected, loading agent config
2. GREETING
└── Playing initial greeting message
3. LISTENING
└── Waiting for user speech
4. PROCESSING
└── STT → LLM → TTS pipeline active
5. SPEAKING
└── Playing TTS audio to user
6. TRANSFERRING (optional)
└── Connecting to human agent
7. ENDING
└── Cleanup, save recordings, log disposition
Providers
Providers are pluggable components that handle specific tasks:
Provider Interface
Each provider type implements a standard interface:
// STT Provider
type STTProvider interface {
Connect(ctx context.Context) error
SendAudio(audio []byte) error
ReceiveTranscript() <-chan TranscriptResult
Close() error
}
// TTS Provider
type TTSProvider interface {
Synthesize(ctx context.Context, text string) (<-chan []byte, error)
Close() error
}
// LLM Provider
type LLMProvider interface {
Generate(ctx context.Context, messages []Message, tools []Tool) (<-chan string, error)
Close() error
}
// Call Provider (Telephony)
type CallProvider interface {
ProcessInput(audioChan chan []byte, sttOutputChan chan string)
ProcessOutput(ttsOutputChan chan string)
SendClear() error
Close() error
}
Provider Selection
Providers are selected per-agent based on configuration:
{
"agent": {
"name": "Customer Support",
"language": "en",
"sttProvider": "deepgram",
"ttsProvider": "cartesia",
"llmProvider": "openai"
}
}
Voice Activity Detection (VAD)
VAD determines when the user is speaking vs. silent:
Audio Signal:
─────┬──────────────────┬────────────────┬──────────
│ User Speech │ Silence │ Speech
│ │ │
VAD: ─────█████████████──────────────────█████─────
↑ ↑ ↑
Speech Start Speech End Speech Start
VAD Parameters
| Parameter | Description | Default |
|---|---|---|
threshold |
Speech probability threshold | 0.8 |
min_silence_duration |
Silence before end-of-speech | 200ms |
volume_threshold |
Minimum audio level | 0.0 |
sample_rate |
Audio sample rate | 8000 Hz |
Interruptions (Barge-In)
When a user speaks while the bot is talking, we need to handle the interruption gracefully:
Timeline:
──────────────────────────────────────────────────────────▶
Bot speaking: "Your order is currently being processed..."
│
User speaks: "When will it arrive?"
│
▼
┌─────────────────┐
│ InterruptionFrame │
└─────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
Clear TTS Cancel LLM Process new input
buffer generation "When will it arrive?"
Interruption Handling Strategies
| Strategy | Behavior | Use Case |
|---|---|---|
immediate |
Stop instantly on any speech | Fast-paced conversations |
sentence |
Complete current sentence | More natural flow |
disabled |
Never interrupt | IVR menus, important info |
Context Management
The LLM needs conversation context to generate relevant responses:
type ConversationContext struct {
SystemPrompt string // Agent personality/instructions
Messages []Message // Conversation history
Variables map[string]string // Dynamic variables
Tools []Tool // Available functions
CurrentIntent string // Detected user intent
}
type Message struct {
Role string // "user", "assistant", "system", "tool"
Content string // Message text
Timestamp time.Time
ToolCalls []ToolCall // For assistant messages with function calls
}
Variable Substitution
Variables allow dynamic content in prompts and responses:
System Prompt:
"You are a customer support agent for {{company_name}}.
The customer's name is {{customer_name}} and their
order number is {{order_id}}."
Variables:
{
"company_name": "Acme Corp",
"customer_name": "John",
"order_id": "ORD-12345"
}
Result:
"You are a customer support agent for Acme Corp.
The customer's name is John and their
order number is ORD-12345."
Events and Webhooks
The system emits events at key points for external integration:
Event Types
| Event | Trigger | Payload |
|---|---|---|
call.started |
Call connected | call_sid, phone_number, agent_id |
call.ended |
Call terminated | duration, disposition, recording_url |
transcript.updated |
New transcript | role, content, timestamp |
function.called |
Tool invoked | function_name, arguments, result |
transfer.initiated |
Transfer started | target_number, reason |
Webhook Integration
// Webhook payload example
{
"event": "call.ended",
"timestamp": "2024-01-15T10:30:00Z",
"call_sid": "CA123456",
"data": {
"duration": 120,
"disposition": "SUCCESS",
"recording_url": "https://...",
"transcript": [...]
}
}
Next Steps
Now that you understand the core concepts:
- Configuration Guide - Set up your first agent
- Pipeline Deep Dive - Understand frame processing
- Provider Setup - Configure STT/TTS/LLM providers