Voice Activity Detection (VAD)
VAD is the first stage of audio processing - it determines when the user is speaking and when they've finished. Getting VAD right is crucial for natural conversations.
Why VAD Matters
Without proper VAD:
─────────────────────────────────────────────────────────────
User: "I want to check my order status" [background noise...]
Bot: [Waits for noise to stop... 3 seconds pass]
"I'd be happy to help! What's your order number?"
User: [Frustrated by delay]
With optimized VAD:
─────────────────────────────────────────────────────────────
User: "I want to check my order status"
Bot: [Detects end of speech in 200ms]
"I'd be happy to help! What's your order number?"
User: [Smooth conversation]
Silero VAD
We use Silero VAD, a state-of-the-art neural network-based VAD that runs locally:
Advantages
- High accuracy: 99%+ on clean speech
- Low latency: ~10ms per frame
- No cloud dependency: Runs entirely on-device
- Small footprint: ~2MB model file
How It Works
Audio Input (8kHz, 16-bit PCM)
│
▼
┌───────────────────┐
│ Silero ONNX │
│ Neural Network │
│ │
│ Input: 512 samples│
│ Output: 0.0-1.0 │
│ (speech prob) │
└───────────────────┘
│
▼
Speech Probability
│
┌───┴───┐
│ │
<0.8 ≥0.8
│ │
Silence Speech
Configuration
Basic Configuration
cfg := silero.DetectorConfig{
ModelPath: "./silero_vad.onnx",
SampleRate: 8000, // Match telephony audio
Threshold: 0.8, // Speech probability threshold
MinSilenceDurationMs: 200, // Silence before end-of-speech
LogLevel: silero.LogLevelError,
}
detector, err := silero.NewDetector(cfg)
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
ModelPath |
string | required | Path to silero_vad.onnx |
SampleRate |
int | 8000 | Audio sample rate |
Threshold |
float32 | 0.8 | Speech probability threshold (0.0-1.0) |
MinSilenceDurationMs |
int | 200 | Silence duration before end-of-speech |
VolumeThreshold |
float32 | 0.0 | Minimum audio level for processing |
Threshold Tuning
Threshold = 0.5 (too sensitive)
─────────────────────────────────────────────────────────────
Detects: speech, breathing, background noise, air conditioning
Result: False positives, bot interrupts itself
Threshold = 0.9 (too strict)
─────────────────────────────────────────────────────────────
Detects: Only loud, clear speech
Result: Misses quiet speakers, soft endings
Threshold = 0.8 (balanced)
─────────────────────────────────────────────────────────────
Detects: Normal speech, ignores most noise
Result: Good for most telephony use cases
Per-Agent Configuration
Configure VAD per agent based on use case:
{
"agent": {
"name": "Call Center Agent",
"vadConfig": {
"threshold": 0.75,
"minSilenceDurationMs": 250,
"volumeThreshold": 0.02
}
}
}
Speech Detection Events
The VAD emits events that drive the pipeline:
type VADEvent struct {
Type VADEventType
Timestamp time.Time
Duration time.Duration
}
type VADEventType int
const (
SpeechStart VADEventType = iota // User started speaking
SpeechEnd // User stopped speaking
Interruption // User spoke during bot output
)
Event Flow Example
Timeline (ms): 0 100 200 300 400 500 600 700 800
User Audio: [─────████████████████████████─────────────────────]
│ │
Events: SpeechStart SpeechEnd
(100ms) (500ms)
+ 200ms silence
= 700ms event fired
Volume Threshold
Filter out low-level background noise before VAD processing:
// Calculate RMS volume
func calculateVolume(samples []int16) float32 {
var sum float64
for _, sample := range samples {
sum += float64(sample) * float64(sample)
}
rms := math.Sqrt(sum / float64(len(samples)))
return float32(rms / 32768.0) // Normalize to 0.0-1.0
}
// In VAD processing
func (d *Detector) processAudio(audio []byte) {
volume := calculateVolume(audio)
if volume < d.VolumeThreshold {
// Skip VAD processing for very quiet audio
return
}
// Proceed with neural network inference
probability := d.model.Infer(audio)
// ...
}
Volume Threshold Guidelines
| Environment | Recommended Threshold |
|---|---|
| Quiet office | 0.01 |
| Normal telephony | 0.02 |
| Noisy call center | 0.03-0.05 |
| Very noisy (street) | 0.05-0.1 |
Interruption Detection
VAD also enables interruption handling when the user speaks during bot output:
type InterruptionHandler struct {
callback func()
enabled bool
}
func (h *InterruptionHandler) OnVADEvent(event VADEvent) {
if event.Type == SpeechStart && h.enabled {
// User is speaking while bot is outputting
h.callback() // Trigger interruption
}
}
// In call provider
provider.SetInterruptionHandler(&InterruptionHandler{
callback: func() {
provider.ClearPlayback() // Stop current audio
llm.CancelGeneration() // Cancel LLM
// Process new user input
},
enabled: agent.AllowInterruptions,
})
STT Mute Filter
Control when audio is sent to STT based on VAD and bot state:
type STTMuteFilter struct {
muteStrategies []MuteStrategy
}
type MuteStrategy interface {
ShouldMute() bool
}
// Mute during bot speech (echo cancellation)
type MuteDuringBotSpeech struct {
user *session.User
}
func (m *MuteDuringBotSpeech) ShouldMute() bool {
return m.user.IsBotSpeaking()
}
// Mute during function execution
type MuteDuringFunctionCall struct {
user *session.User
}
func (m *MuteDuringFunctionCall) ShouldMute() bool {
return m.user.IsExecutingFunction()
}
Best Practices
1. Match Sample Rate
Ensure audio sample rate matches VAD configuration:
// ❌ Wrong - sample rate mismatch
cfg := silero.DetectorConfig{
SampleRate: 16000, // VAD expects 16kHz
}
// But telephony sends 8kHz audio - will cause detection issues
// ✅ Correct
cfg := silero.DetectorConfig{
SampleRate: 8000, // Matches telephony audio
}
2. Handle Edge Cases
// Very short utterances (< 200ms)
if speechDuration < 200*time.Millisecond {
// Might be noise, cough, or "uh-huh"
// Consider waiting for more speech
}
// Very long silence after speech
if silenceDuration > 3*time.Second {
// User might be thinking or distracted
// Consider sending a prompt
}
3. Adaptive Thresholds
Adjust thresholds based on call quality:
func adaptThreshold(initialNoiseLevel float32) float32 {
if initialNoiseLevel > 0.1 {
return 0.85 // Noisy environment - be more strict
}
return 0.8 // Normal threshold
}
Debugging VAD Issues
Common Problems
| Issue | Symptom | Solution |
|---|---|---|
| False positives | Bot interrupts itself | Increase threshold, add volume filter |
| Missed speech | Bot doesn't respond | Decrease threshold |
| Late detection | Delayed responses | Reduce MinSilenceDurationMs |
| Cut-off speech | Responses before user finishes | Increase MinSilenceDurationMs |
Logging VAD Events
detector.SetLogLevel(silero.LogLevelDebug)
// Output:
// [VAD] Frame 1: prob=0.12, speaking=false
// [VAD] Frame 2: prob=0.45, speaking=false
// [VAD] Frame 3: prob=0.89, speaking=true, event=SpeechStart
// [VAD] Frame 4: prob=0.92, speaking=true
// ...
// [VAD] Frame N: prob=0.15, speaking=true
// [VAD] Frame N+1: prob=0.08, speaking=false
// [VAD] Silence duration: 200ms, event=SpeechEnd
Next Steps
- Interruption Handling - Handle user barge-in
- Turn Detection - Advanced turn-taking
- Latency Optimization - Reduce response time