Core Concepts

Before diving into implementation, it's important to understand the core concepts that make Edesy Voice Agent work.

The Voice Agent Pipeline

A voice agent is essentially a real-time processing pipeline that transforms speech into intelligent responses:

┌─────────────────────────────────────────────────────────────────────┐
│                         Voice Agent Pipeline                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌──────┐ │
│   │  Audio  │──▶│   VAD   │──▶│   STT   │──▶│   LLM   │──▶│ TTS  │ │
│   │  Input  │   │         │   │         │   │         │   │      │ │
│   └─────────┘   └─────────┘   └─────────┘   └─────────┘   └──┬───┘ │
│                      │                            │           │     │
│                      │         Interruption       │           │     │
│                      └────────────────────────────┘           │     │
│                                                               ▼     │
│                                                         ┌──────────┐│
│                                                         │  Audio   ││
│                                                         │  Output  ││
│                                                         └──────────┘│
└─────────────────────────────────────────────────────────────────────┘

Frames

Frames are the fundamental unit of data in the pipeline. Everything - audio, text, control signals - flows through the system as frames.

Frame Types

Frame Type	Purpose	Example
`InputAudioFrame`	Raw audio from user	8kHz PCM audio bytes
`TranscriptionFrame`	Text from STT	"What is my order status?"
`InterimTranscriptionFrame`	Partial STT result	"What is my"
`LLMResponseFrame`	Generated response	"Your order is on the way"
`TTSAudioFrame`	Synthesized speech	Audio bytes
`InterruptionFrame`	User barge-in signal	Cancel current output
`EndFrame`	Call termination	Reason, disposition
`FunctionCallFrame`	Tool invocation	Function name, arguments

Frame Flow Example

User says: "What's the status of order twelve thirty four?"

Frame 1: InputAudioFrame { audio: [bytes...], timestamp: 0ms }
Frame 2: InputAudioFrame { audio: [bytes...], timestamp: 20ms }
...
Frame N: InterimTranscriptionFrame { text: "What's the status" }
Frame N+1: InterimTranscriptionFrame { text: "What's the status of order" }
Frame N+2: TranscriptionFrame { text: "What's the status of order 1234?", is_final: true }
Frame N+3: LLMResponseFrame { text: "Let me check that for you..." }
Frame N+4: FunctionCallFrame { name: "get_order_status", args: { order_id: "1234" } }
Frame N+5: LLMResponseFrame { text: "Your order 1234 is out for delivery..." }
Frame N+6: TTSAudioFrame { audio: [bytes...] }

Sessions

A session represents a single call/conversation with state:

type Session struct {
    // Identity
    CallSid     string    // Unique call identifier
    StreamSid   string    // Audio stream identifier
    UserIdPin   string    // Internal reference

    // Configuration
    Agent       *AgentConfig
    Language    string
    Providers   ProviderConfig

    // State
    Transcript  []Message
    Variables   map[string]string
    StartTime   time.Time
    Status      CallStatus

    // Channels
    STTChannel  chan string
    TTSChannel  chan string
    EventChan   chan Event
}

Session Lifecycle

1. INITIALIZING
   └── WebSocket connected, loading agent config

2. GREETING
   └── Playing initial greeting message

3. LISTENING
   └── Waiting for user speech

4. PROCESSING
   └── STT → LLM → TTS pipeline active

5. SPEAKING
   └── Playing TTS audio to user

6. TRANSFERRING (optional)
   └── Connecting to human agent

7. ENDING
   └── Cleanup, save recordings, log disposition

Providers

Providers are pluggable components that handle specific tasks:

Provider Interface

Each provider type implements a standard interface:

// STT Provider
type STTProvider interface {
    Connect(ctx context.Context) error
    SendAudio(audio []byte) error
    ReceiveTranscript() <-chan TranscriptResult
    Close() error
}

// TTS Provider
type TTSProvider interface {
    Synthesize(ctx context.Context, text string) (<-chan []byte, error)
    Close() error
}

// LLM Provider
type LLMProvider interface {
    Generate(ctx context.Context, messages []Message, tools []Tool) (<-chan string, error)
    Close() error
}

// Call Provider (Telephony)
type CallProvider interface {
    ProcessInput(audioChan chan []byte, sttOutputChan chan string)
    ProcessOutput(ttsOutputChan chan string)
    SendClear() error
    Close() error
}

Provider Selection

Providers are selected per-agent based on configuration:

{
  "agent": {
    "name": "Customer Support",
    "language": "en",
    "sttProvider": "deepgram",
    "ttsProvider": "cartesia",
    "llmProvider": "openai"
  }
}

Voice Activity Detection (VAD)

VAD determines when the user is speaking vs. silent:

Audio Signal:
─────┬──────────────────┬────────────────┬──────────
     │   User Speech    │    Silence     │   Speech
     │                  │                │
VAD: ─────█████████████──────────────────█████─────
          ↑            ↑                 ↑
     Speech Start  Speech End      Speech Start

VAD Parameters

Parameter	Description	Default
`threshold`	Speech probability threshold	0.8
`min_silence_duration`	Silence before end-of-speech	200ms
`volume_threshold`	Minimum audio level	0.0
`sample_rate`	Audio sample rate	8000 Hz

Interruptions (Barge-In)

When a user speaks while the bot is talking, we need to handle the interruption gracefully:

Timeline:
──────────────────────────────────────────────────────────▶

Bot speaking:  "Your order is currently being processed..."
                              │
User speaks:          "When will it arrive?"
                              │
                              ▼
                    ┌─────────────────┐
                    │ InterruptionFrame │
                    └─────────────────┘
                              │
               ┌──────────────┼──────────────┐
               ▼              ▼              ▼
        Clear TTS      Cancel LLM    Process new input
         buffer        generation    "When will it arrive?"

Interruption Handling Strategies

Strategy	Behavior	Use Case
`immediate`	Stop instantly on any speech	Fast-paced conversations
`sentence`	Complete current sentence	More natural flow
`disabled`	Never interrupt	IVR menus, important info

Context Management

The LLM needs conversation context to generate relevant responses:

type ConversationContext struct {
    SystemPrompt   string            // Agent personality/instructions
    Messages       []Message         // Conversation history
    Variables      map[string]string // Dynamic variables
    Tools          []Tool            // Available functions
    CurrentIntent  string            // Detected user intent
}

type Message struct {
    Role      string    // "user", "assistant", "system", "tool"
    Content   string    // Message text
    Timestamp time.Time
    ToolCalls []ToolCall // For assistant messages with function calls
}

Variable Substitution

Variables allow dynamic content in prompts and responses:

System Prompt:
"You are a customer support agent for {{company_name}}.
The customer's name is {{customer_name}} and their
order number is {{order_id}}."

Variables:
{
  "company_name": "Acme Corp",
  "customer_name": "John",
  "order_id": "ORD-12345"
}

Result:
"You are a customer support agent for Acme Corp.
The customer's name is John and their
order number is ORD-12345."

Events and Webhooks

The system emits events at key points for external integration:

Event Types

Event	Trigger	Payload
`call.started`	Call connected	call_sid, phone_number, agent_id
`call.ended`	Call terminated	duration, disposition, recording_url
`transcript.updated`	New transcript	role, content, timestamp
`function.called`	Tool invoked	function_name, arguments, result
`transfer.initiated`	Transfer started	target_number, reason

Webhook Integration

// Webhook payload example
{
  "event": "call.ended",
  "timestamp": "2024-01-15T10:30:00Z",
  "call_sid": "CA123456",
  "data": {
    "duration": 120,
    "disposition": "SUCCESS",
    "recording_url": "https://...",
    "transcript": [...]
  }
}

Next Steps

Now that you understand the core concepts:

Configuration Guide - Set up your first agent
Pipeline Deep Dive - Understand frame processing
Provider Setup - Configure STT/TTS/LLM providers