Architecture Overview
Edesy Voice Agent uses a frame-based pipeline architecture inspired by Pipecat, optimized for low-latency real-time voice interactions.
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Twilio │ │ Exotel │ │ Browser (WebRTC) │ │
│ └──────┬──────┘ └──────┬──────┘ └───────────┬─────────────┘ │
│ │ │ │ │
│ └────────────────┼─────────────────────┘ │
│ │ WebSocket │
└──────────────────────────┼───────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Voice Engine (Go) │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Frame Pipeline ││
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ││
│ │ │ VAD │ → │ STT │ → │ LLM │ → │ TTS │ → │ Out │ ││
│ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ ││
│ │ ↑ │ ││
│ │ └────────── Interruption Handler ────────┘ ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌────────────────┐ │
│ │ Session Manager │ │ Tool Executor │ │ Event Queue │ │
│ └──────────────────┘ └──────────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Redis │ │ PostgreSQL │ │ Object Storage (S3) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Frame-Based Pipeline
The core of our architecture is the frame-based pipeline, where audio and control data flow through discrete frames:
Frame Types
| Frame Type | Description | Direction |
|---|---|---|
InputAudioFrame |
Raw audio from user | Input |
TranscriptionFrame |
Text from STT | Internal |
LLMResponseFrame |
Text from LLM | Internal |
TTSAudioFrame |
Generated speech | Output |
InterruptionFrame |
User interruption signal | Control |
EndFrame |
Call termination | Control |
Pipeline Flow
// Simplified pipeline flow
func (p *Pipeline) Process(inputAudio []byte) {
// 1. VAD Detection
if p.vad.IsSpeech(inputAudio) {
// 2. Send to STT
transcript := p.stt.Transcribe(inputAudio)
// 3. Send to LLM
response := p.llm.Generate(transcript)
// 4. Send to TTS
audio := p.tts.Synthesize(response)
// 5. Output to caller
p.output.Send(audio)
}
}
Voice Activity Detection (VAD)
We use Silero VAD for accurate speech detection:
- Sample Rate: 8kHz (telephony standard)
- Threshold: Configurable (default 0.8)
- Min Silence: 200ms before end-of-speech
cfg := silero.DetectorConfig{
ModelPath: "./silero_vad.onnx",
SampleRate: 8000,
Threshold: 0.8,
MinSilenceDurationMs: 200,
}
Interruption Handling
When a user interrupts (barge-in), the system:
- Detects speech via VAD during bot output
- Clears TTS buffer immediately
- Cancels pending LLM generation
- Processes new user input
User: "What's my order—"
Bot: "Your order status is—" [INTERRUPTED]
User: "—when will it arrive?"
Bot: "Your order will arrive tomorrow by 5 PM."
Provider Abstraction
All providers implement common interfaces:
type STTProvider interface {
Transcribe(ctx context.Context, audio []byte) (string, error)
StreamTranscribe(ctx context.Context, audioChan <-chan []byte) (<-chan string, error)
}
type TTSProvider interface {
Synthesize(ctx context.Context, text string) ([]byte, error)
StreamSynthesize(ctx context.Context, text string) (<-chan []byte, error)
}
type LLMProvider interface {
Generate(ctx context.Context, messages []Message) (string, error)
StreamGenerate(ctx context.Context, messages []Message) (<-chan string, error)
}
Session Management
Each call creates a session with:
- User Context: Phone number, variables, history
- Agent Config: Prompt, provider settings, tools
- Call State: Status, timestamps, recordings
type Session struct {
User *User
Agent *AgentConfig
CallSid string
StreamSid string
StartTime time.Time
Transcript []Message
}
Latency Optimization
Streaming Everything
- STT: Interim results sent as user speaks
- LLM: Token-by-token streaming
- TTS: Chunked audio generation
Prefetching
- Greeting Audio: Pre-cached at agent creation
- Agent Config: Cached in Redis
Connection Reuse
- Provider connections: Persistent WebSocket/gRPC
- Redis connection pool: Shared across sessions
Next Steps
- Quick Start - Deploy your first agent
- Telephony Setup - Configure phone providers
- Provider Configuration - Optimize STT/TTS