Voice Activity Detection (VAD)

VAD is the first stage of audio processing - it determines when the user is speaking and when they've finished. Getting VAD right is crucial for natural conversations.

Why VAD Matters

Without proper VAD:
─────────────────────────────────────────────────────────────
User: "I want to check my order status" [background noise...]
Bot:  [Waits for noise to stop... 3 seconds pass]
      "I'd be happy to help! What's your order number?"
User: [Frustrated by delay]

With optimized VAD:
─────────────────────────────────────────────────────────────
User: "I want to check my order status"
Bot:  [Detects end of speech in 200ms]
      "I'd be happy to help! What's your order number?"
User: [Smooth conversation]

Silero VAD

We use Silero VAD, a state-of-the-art neural network-based VAD that runs locally:

Advantages

High accuracy: 99%+ on clean speech
Low latency: ~10ms per frame
No cloud dependency: Runs entirely on-device
Small footprint: ~2MB model file

How It Works

Audio Input (8kHz, 16-bit PCM)
        │
        ▼
┌───────────────────┐
│   Silero ONNX     │
│   Neural Network  │
│                   │
│ Input: 512 samples│
│ Output: 0.0-1.0   │
│ (speech prob)     │
└───────────────────┘
        │
        ▼
Speech Probability
        │
    ┌───┴───┐
    │       │
   <0.8    ≥0.8
    │       │
 Silence  Speech

Configuration

Basic Configuration

cfg := silero.DetectorConfig{
    ModelPath:            "./silero_vad.onnx",
    SampleRate:           8000,          // Match telephony audio
    Threshold:            0.8,           // Speech probability threshold
    MinSilenceDurationMs: 200,           // Silence before end-of-speech
    LogLevel:             silero.LogLevelError,
}

detector, err := silero.NewDetector(cfg)

Configuration Parameters

Parameter	Type	Default	Description
`ModelPath`	string	required	Path to silero_vad.onnx
`SampleRate`	int	8000	Audio sample rate
`Threshold`	float32	0.8	Speech probability threshold (0.0-1.0)
`MinSilenceDurationMs`	int	200	Silence duration before end-of-speech
`VolumeThreshold`	float32	0.0	Minimum audio level for processing

Threshold Tuning

Threshold = 0.5 (too sensitive)
─────────────────────────────────────────────────────────────
Detects: speech, breathing, background noise, air conditioning
Result: False positives, bot interrupts itself

Threshold = 0.9 (too strict)
─────────────────────────────────────────────────────────────
Detects: Only loud, clear speech
Result: Misses quiet speakers, soft endings

Threshold = 0.8 (balanced)
─────────────────────────────────────────────────────────────
Detects: Normal speech, ignores most noise
Result: Good for most telephony use cases

Per-Agent Configuration

Configure VAD per agent based on use case:

{
  "agent": {
    "name": "Call Center Agent",
    "vadConfig": {
      "threshold": 0.75,
      "minSilenceDurationMs": 250,
      "volumeThreshold": 0.02
    }
  }
}

Speech Detection Events

The VAD emits events that drive the pipeline:

type VADEvent struct {
    Type      VADEventType
    Timestamp time.Time
    Duration  time.Duration
}

type VADEventType int

const (
    SpeechStart  VADEventType = iota  // User started speaking
    SpeechEnd                          // User stopped speaking
    Interruption                       // User spoke during bot output
)

Event Flow Example

Timeline (ms):  0    100   200   300   400   500   600   700   800
User Audio:     [─────████████████████████████─────────────────────]
                      │                      │
Events:           SpeechStart            SpeechEnd
                    (100ms)               (500ms)
                                         + 200ms silence
                                         = 700ms event fired

Volume Threshold

Filter out low-level background noise before VAD processing:

// Calculate RMS volume
func calculateVolume(samples []int16) float32 {
    var sum float64
    for _, sample := range samples {
        sum += float64(sample) * float64(sample)
    }
    rms := math.Sqrt(sum / float64(len(samples)))
    return float32(rms / 32768.0) // Normalize to 0.0-1.0
}

// In VAD processing
func (d *Detector) processAudio(audio []byte) {
    volume := calculateVolume(audio)

    if volume < d.VolumeThreshold {
        // Skip VAD processing for very quiet audio
        return
    }

    // Proceed with neural network inference
    probability := d.model.Infer(audio)
    // ...
}

Volume Threshold Guidelines

Environment	Recommended Threshold
Quiet office	0.01
Normal telephony	0.02
Noisy call center	0.03-0.05
Very noisy (street)	0.05-0.1

Interruption Detection

VAD also enables interruption handling when the user speaks during bot output:

type InterruptionHandler struct {
    callback func()
    enabled  bool
}

func (h *InterruptionHandler) OnVADEvent(event VADEvent) {
    if event.Type == SpeechStart && h.enabled {
        // User is speaking while bot is outputting
        h.callback() // Trigger interruption
    }
}

// In call provider
provider.SetInterruptionHandler(&InterruptionHandler{
    callback: func() {
        provider.ClearPlayback()     // Stop current audio
        llm.CancelGeneration()       // Cancel LLM
        // Process new user input
    },
    enabled: agent.AllowInterruptions,
})

STT Mute Filter

Control when audio is sent to STT based on VAD and bot state:

type STTMuteFilter struct {
    muteStrategies []MuteStrategy
}

type MuteStrategy interface {
    ShouldMute() bool
}

// Mute during bot speech (echo cancellation)
type MuteDuringBotSpeech struct {
    user *session.User
}

func (m *MuteDuringBotSpeech) ShouldMute() bool {
    return m.user.IsBotSpeaking()
}

// Mute during function execution
type MuteDuringFunctionCall struct {
    user *session.User
}

func (m *MuteDuringFunctionCall) ShouldMute() bool {
    return m.user.IsExecutingFunction()
}

Best Practices

1. Match Sample Rate

Ensure audio sample rate matches VAD configuration:

// ❌ Wrong - sample rate mismatch
cfg := silero.DetectorConfig{
    SampleRate: 16000, // VAD expects 16kHz
}
// But telephony sends 8kHz audio - will cause detection issues

// ✅ Correct
cfg := silero.DetectorConfig{
    SampleRate: 8000, // Matches telephony audio
}

2. Handle Edge Cases

// Very short utterances (< 200ms)
if speechDuration < 200*time.Millisecond {
    // Might be noise, cough, or "uh-huh"
    // Consider waiting for more speech
}

// Very long silence after speech
if silenceDuration > 3*time.Second {
    // User might be thinking or distracted
    // Consider sending a prompt
}

3. Adaptive Thresholds

Adjust thresholds based on call quality:

func adaptThreshold(initialNoiseLevel float32) float32 {
    if initialNoiseLevel > 0.1 {
        return 0.85 // Noisy environment - be more strict
    }
    return 0.8 // Normal threshold
}

Debugging VAD Issues

Common Problems

Issue	Symptom	Solution
False positives	Bot interrupts itself	Increase threshold, add volume filter
Missed speech	Bot doesn't respond	Decrease threshold
Late detection	Delayed responses	Reduce MinSilenceDurationMs
Cut-off speech	Responses before user finishes	Increase MinSilenceDurationMs

Logging VAD Events

detector.SetLogLevel(silero.LogLevelDebug)

// Output:
// [VAD] Frame 1: prob=0.12, speaking=false
// [VAD] Frame 2: prob=0.45, speaking=false
// [VAD] Frame 3: prob=0.89, speaking=true, event=SpeechStart
// [VAD] Frame 4: prob=0.92, speaking=true
// ...
// [VAD] Frame N: prob=0.15, speaking=true
// [VAD] Frame N+1: prob=0.08, speaking=false
// [VAD] Silence duration: 200ms, event=SpeechEnd

Next Steps

Interruption Handling - Handle user barge-in
Turn Detection - Advanced turn-taking
Latency Optimization - Reduce response time