Turn Detection
Turn detection determines when the user has finished speaking and it's the bot's turn to respond. Getting this right is crucial for natural conversations.
What is Turn Detection?
Natural Conversation Flow:
─────────────────────────────────────────────────────────────
User: "I want to check my order status" ─────────────┐
│ Turn boundary
Bot: "Sure, what's your order number?" ◄─────────────┘
─────────────┐
User: "It's 12345" ──────────────────────────────────┤ Turn boundary
│
Bot: "Your order has shipped..." ◄───────────────────┘
Poor Turn Detection:
─────────────────────────────────────────────────────────────
User: "I want to check my—"
Bot: [Interrupts] "How can I help?" ← Bot spoke too early
User: "I want to check my order status"
[3 second pause]
Bot: "Sure, what's your order number?" ← Bot spoke too late
Components of Turn Detection
┌─────────────────────────────────────────┐
│ Turn Detection │
│ │
User Audio ────────►│ VAD ──► Endpointing ──► Confirmation │────► End of Turn
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Speech Silence Semantic │
│ Prob. Duration Analysis │
│ │
└─────────────────────────────────────────┘
1. VAD (Voice Activity Detection)
Detects speech vs silence:
type VADResult struct {
IsSpeech bool
Probability float32
Timestamp time.Time
}
// VAD emits: speech_start, speech_end events
2. Endpointing
Determines when speech has ended:
type EndpointingConfig struct {
MinSilenceDuration time.Duration // Silence to trigger end
MaxSpeechDuration time.Duration // Maximum turn length
VolumeThreshold float32 // Minimum audio level
}
3. Semantic Confirmation
Uses STT and context for smarter detection:
type SemanticTurnDetector struct {
stt STTProvider
pendingText string
lastWordTime time.Time
}
func (d *SemanticTurnDetector) IsCompleteTurn(transcript string) bool {
// Check for complete sentence
if endsWithPunctuation(transcript) {
return true
}
// Check for question patterns
if startsWithQuestion(transcript) && len(transcript) > 20 {
return true
}
// Check for trailing silence after content
if time.Since(d.lastWordTime) > 500*time.Millisecond {
return true
}
return false
}
Configuration
Basic Configuration
{
"agent": {
"turnDetection": {
"mode": "vad",
"vadThreshold": 0.8,
"silenceDuration": 300,
"maxTurnDuration": 30000
}
}
}
Mode Options
| Mode | Description | Latency | Accuracy |
|---|---|---|---|
vad |
VAD + silence timer | ⚡ Fastest | Good |
semantic |
VAD + STT analysis | 🚀 Fast | Better |
hybrid |
Combines both | 🚀 Fast | Best |
VAD Mode
Simple silence-based detection:
func (d *VADTurnDetector) OnVADEvent(event VADEvent) {
switch event.Type {
case SpeechStart:
d.turnStartTime = time.Now()
d.isSpeaking = true
case SpeechEnd:
// Wait for configured silence duration
time.AfterFunc(d.silenceDuration, func() {
if !d.isSpeaking {
d.emitEndOfTurn()
}
})
d.isSpeaking = false
}
}
Semantic Mode
Uses transcript content for smarter detection:
func (d *SemanticTurnDetector) OnTranscript(event TranscriptEvent) {
d.pendingText = event.Text
d.lastWordTime = time.Now()
if event.IsFinal {
// Final transcript from STT endpointing
d.emitEndOfTurn()
return
}
// Analyze for completeness
if d.isSemanticComplete(event.Text) {
// Give brief pause for continuation
time.AfterFunc(200*time.Millisecond, func() {
if d.pendingText == event.Text {
d.emitEndOfTurn()
}
})
}
}
func (d *SemanticTurnDetector) isSemanticComplete(text string) bool {
text = strings.TrimSpace(text)
// Ends with punctuation
if strings.HasSuffix(text, ".") ||
strings.HasSuffix(text, "?") ||
strings.HasSuffix(text, "!") {
return true
}
// Short affirmative/negative responses
shortResponses := []string{"yes", "no", "okay", "sure", "thanks", "bye"}
lower := strings.ToLower(text)
for _, resp := range shortResponses {
if lower == resp {
return true
}
}
return false
}
Hybrid Mode
Combines VAD and semantic analysis:
type HybridTurnDetector struct {
vadDetector *VADTurnDetector
semanticDetector *SemanticTurnDetector
pendingEndOfTurn bool
}
func (d *HybridTurnDetector) Process(event any) {
switch e := event.(type) {
case VADEvent:
d.vadDetector.OnVADEvent(e)
if e.Type == SpeechEnd {
// VAD says speech ended, check semantic
if d.semanticDetector.isSemanticComplete(d.pendingText) {
d.emitEndOfTurn()
} else {
d.pendingEndOfTurn = true
}
}
case TranscriptEvent:
d.semanticDetector.OnTranscript(e)
if d.pendingEndOfTurn && d.semanticDetector.isSemanticComplete(e.Text) {
d.emitEndOfTurn()
d.pendingEndOfTurn = false
}
}
}
STT Endpointing Integration
Leverage STT provider's endpointing:
Deepgram
{
"sttConfig": {
"endpointing": 300,
"utterance_end_ms": 1000,
"interim_results": true
}
}
{
"sttConfig": {
"singleUtterance": false,
"voiceActivityTimeout": {
"speechEndTimeout": "1s"
}
}
}
Use Case Configurations
Fast-Paced Support
Quick responses for simple queries:
{
"turnDetection": {
"mode": "vad",
"silenceDuration": 200,
"vadThreshold": 0.75
}
}
Thoughtful Conversations
Allow pauses for complex topics:
{
"turnDetection": {
"mode": "semantic",
"silenceDuration": 500,
"allowThinkingPauses": true
}
}
Elderly/Accessibility
More patient turn detection:
{
"turnDetection": {
"mode": "hybrid",
"silenceDuration": 700,
"vadThreshold": 0.85,
"maxTurnDuration": 60000
}
}
IVR/Commands
Quick command detection:
{
"turnDetection": {
"mode": "vad",
"silenceDuration": 150,
"shortResponseMode": true
}
}
Handling Edge Cases
Trailing Filler Words
func (d *SemanticTurnDetector) stripFillers(text string) string {
fillers := []string{" um", " uh", " like", " you know", " so"}
for _, filler := range fillers {
text = strings.TrimSuffix(text, filler)
}
return text
}
func (d *SemanticTurnDetector) isSemanticComplete(text string) bool {
// Strip trailing fillers before checking
text = d.stripFillers(text)
// ... rest of logic
}
Multi-Sentence Turns
func (d *SemanticTurnDetector) isMultiSentenceTurn(text string) bool {
sentences := splitSentences(text)
// If first sentence is a question, they might continue
if len(sentences) > 0 && isQuestion(sentences[0]) {
return false // Wait for more
}
// If we have 2+ complete sentences, probably done
if len(sentences) >= 2 {
return true
}
return false
}
Phone Number Dictation
func (d *SemanticTurnDetector) isPhoneNumberComplete(text string) bool {
// Extract digits
digits := extractDigits(text)
// US phone: 10 digits
if len(digits) == 10 {
return true
}
// International: 11-15 digits
if len(digits) >= 11 && len(digits) <= 15 {
return true
}
return false
}
Metrics and Debugging
Turn Detection Metrics
type TurnMetrics struct {
TurnDurations []time.Duration
SilenceBeforeTurn []time.Duration
InterruptedTurns int
FalseEndpoints int
MissedEndpoints int
}
func (m *TurnMetrics) Record(turn TurnEvent) {
m.TurnDurations = append(m.TurnDurations, turn.Duration)
m.SilenceBeforeTurn = append(m.SilenceBeforeTurn, turn.SilenceBeforeResponse)
if turn.WasInterrupted {
m.InterruptedTurns++
}
}
func (m *TurnMetrics) Analyze() TurnAnalysis {
return TurnAnalysis{
AvgTurnDuration: average(m.TurnDurations),
AvgSilenceBeforeResponse: average(m.SilenceBeforeTurn),
InterruptionRate: float64(m.InterruptedTurns) / float64(len(m.TurnDurations)),
}
}
Debug Logging
func (d *TurnDetector) SetDebugMode(enabled bool) {
d.debug = enabled
}
func (d *TurnDetector) debugLog(format string, args ...any) {
if d.debug {
log.Printf("[TurnDetection] "+format, args...)
}
}
// Output:
// [TurnDetection] VAD: speech_start at 0ms
// [TurnDetection] Interim transcript: "What is my"
// [TurnDetection] Interim transcript: "What is my order"
// [TurnDetection] VAD: speech_end at 1500ms
// [TurnDetection] Silence timer started: 300ms
// [TurnDetection] Final transcript: "What is my order status"
// [TurnDetection] Semantic: complete sentence detected
// [TurnDetection] End of turn emitted at 1800ms
Best Practices
1. Start Conservative
// Start with longer silence duration
config := TurnDetectionConfig{
SilenceDuration: 400 * time.Millisecond,
VADThreshold: 0.8,
}
// Tune based on metrics
if avgInterruptionRate > 0.1 {
config.SilenceDuration += 100 * time.Millisecond
}
2. Context-Aware Adjustment
func (d *TurnDetector) adjustForContext(context *ConversationContext) {
// Shorter patience after bot asks question
if context.LastBotMessageWasQuestion {
d.silenceDuration = 250 * time.Millisecond
}
// Longer patience for complex topics
if context.Topic == "technical_support" {
d.silenceDuration = 500 * time.Millisecond
}
}
3. Recover from Errors
func (d *TurnDetector) handlePrematureEnd() {
// If user continues speaking right after we ended turn
if d.userSpeakingWithin(200 * time.Millisecond) {
d.cancelCurrentResponse()
d.resumeListening()
d.silenceDuration += 100 * time.Millisecond // Be more patient
}
}
Next Steps
- VAD Configuration - Voice activity detection
- Interruptions - Handle barge-in
- Latency Optimization - Reduce response time