Architecture Overview

Edesy Voice Agent uses a frame-based pipeline architecture inspired by Pipecat, optimized for low-latency real-time voice interactions.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Client Layer                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Twilio    │  │   Exotel    │  │  Browser (WebRTC)       │  │
│  └──────┬──────┘  └──────┬──────┘  └───────────┬─────────────┘  │
│         │                │                     │                 │
│         └────────────────┼─────────────────────┘                 │
│                          │ WebSocket                             │
└──────────────────────────┼───────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Voice Engine (Go)                           │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Frame Pipeline                            ││
│  │  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐           ││
│  │  │ VAD │ → │ STT │ → │ LLM │ → │ TTS │ → │ Out │           ││
│  │  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘           ││
│  │     ↑                                        │               ││
│  │     └────────── Interruption Handler ────────┘               ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌────────────────┐ │
│  │ Session Manager  │  │  Tool Executor   │  │  Event Queue   │ │
│  └──────────────────┘  └──────────────────┘  └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Data Layer                                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Redis     │  │  PostgreSQL │  │  Object Storage (S3)    │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Frame-Based Pipeline

The core of our architecture is the frame-based pipeline, where audio and control data flow through discrete frames:

Frame Types

Frame Type	Description	Direction
`InputAudioFrame`	Raw audio from user	Input
`TranscriptionFrame`	Text from STT	Internal
`LLMResponseFrame`	Text from LLM	Internal
`TTSAudioFrame`	Generated speech	Output
`InterruptionFrame`	User interruption signal	Control
`EndFrame`	Call termination	Control

Pipeline Flow

// Simplified pipeline flow
func (p *Pipeline) Process(inputAudio []byte) {
    // 1. VAD Detection
    if p.vad.IsSpeech(inputAudio) {
        // 2. Send to STT
        transcript := p.stt.Transcribe(inputAudio)

        // 3. Send to LLM
        response := p.llm.Generate(transcript)

        // 4. Send to TTS
        audio := p.tts.Synthesize(response)

        // 5. Output to caller
        p.output.Send(audio)
    }
}

Voice Activity Detection (VAD)

We use Silero VAD for accurate speech detection:

Sample Rate: 8kHz (telephony standard)
Threshold: Configurable (default 0.8)
Min Silence: 200ms before end-of-speech

cfg := silero.DetectorConfig{
    ModelPath:            "./silero_vad.onnx",
    SampleRate:           8000,
    Threshold:            0.8,
    MinSilenceDurationMs: 200,
}

Interruption Handling

When a user interrupts (barge-in), the system:

Detects speech via VAD during bot output
Clears TTS buffer immediately
Cancels pending LLM generation
Processes new user input

User: "What's my order—"
Bot:  "Your order status is—" [INTERRUPTED]
User: "—when will it arrive?"
Bot:  "Your order will arrive tomorrow by 5 PM."

Provider Abstraction

All providers implement common interfaces:

type STTProvider interface {
    Transcribe(ctx context.Context, audio []byte) (string, error)
    StreamTranscribe(ctx context.Context, audioChan <-chan []byte) (<-chan string, error)
}

type TTSProvider interface {
    Synthesize(ctx context.Context, text string) ([]byte, error)
    StreamSynthesize(ctx context.Context, text string) (<-chan []byte, error)
}

type LLMProvider interface {
    Generate(ctx context.Context, messages []Message) (string, error)
    StreamGenerate(ctx context.Context, messages []Message) (<-chan string, error)
}

Session Management

Each call creates a session with:

User Context: Phone number, variables, history
Agent Config: Prompt, provider settings, tools
Call State: Status, timestamps, recordings

type Session struct {
    User        *User
    Agent       *AgentConfig
    CallSid     string
    StreamSid   string
    StartTime   time.Time
    Transcript  []Message
}

Latency Optimization

Streaming Everything

STT: Interim results sent as user speaks
LLM: Token-by-token streaming
TTS: Chunked audio generation

Prefetching

Greeting Audio: Pre-cached at agent creation
Agent Config: Cached in Redis

Connection Reuse

Provider connections: Persistent WebSocket/gRPC
Redis connection pool: Shared across sessions

Next Steps

Quick Start - Deploy your first agent
Telephony Setup - Configure phone providers
Provider Configuration - Optimize STT/TTS