Problem
The Voice Activity Detection (VAD) system is clipping speech too aggressively, resulting in significant accuracy degradation during dictation sessions.
Symptoms
- Leading edge clipping: First words of utterances are frequently cut off
- Aggressive end detection: Segments end very quickly, not capturing complete thoughts
- Compounding effect: In a single conversational turn with 10 segments, if 8 segments have their first word clipped, the transcription is missing 8 words total
- Critical impact: This severely degrades transcription accuracy
Root Cause
Current VAD tuning parameters are too aggressive for natural speech patterns. The Silero VAD v5 configuration needs adjustment to:
- Better capture the start of speech (leading edge)
- Allow more time before declaring end-of-speech
- Provide more tolerance for brief pauses within continuous speech
Current VAD Parameters
From src/vad/VADConfigs.ts:
High Sensitivity (most commonly used for dictation):
positiveSpeechThreshold: 0.35,
negativeSpeechThreshold: 0.2,
redemptionFrames: 12,
minSpeechFrames: 2,
preSpeechPadFrames: 3,
Balanced (default):
positiveSpeechThreshold: 0.4,
negativeSpeechThreshold: 0.25,
redemptionFrames: 10,
minSpeechFrames: 3,
preSpeechPadFrames: 2,
Recommended Investigation
- Increase
preSpeechPadFrames: Currently 2-3 frames. Try 5-8 frames to capture more leading speech
- Increase
redemptionFrames: Currently 10-12 frames. Try 15-20 frames for longer tail capture and tolerance for brief pauses
- Adjust
negativeSpeechThreshold: Currently 0.2-0.25. Try 0.15-0.2 to avoid premature segment ends
- Test with real audio: Use actual conversation recordings to validate parameter changes
Testing Approach
- Enable
keepSegments config to persist captured audio segments
- Record test dictation sessions with known content
- Compare captured WAV files against expected speech
- Measure:
- Number of segments per utterance
- Percentage of segments with leading/trailing clipping
- Word accuracy before/after parameter tuning
Related Code
Priority
Critical - This issue directly impacts core transcription accuracy and user experience.
Discovered during testing of dual-phase transcription feature (#256)
Problem
The Voice Activity Detection (VAD) system is clipping speech too aggressively, resulting in significant accuracy degradation during dictation sessions.
Symptoms
Root Cause
Current VAD tuning parameters are too aggressive for natural speech patterns. The Silero VAD v5 configuration needs adjustment to:
Current VAD Parameters
From src/vad/VADConfigs.ts:
High Sensitivity (most commonly used for dictation):
Balanced (default):
Recommended Investigation
preSpeechPadFrames: Currently 2-3 frames. Try 5-8 frames to capture more leading speechredemptionFrames: Currently 10-12 frames. Try 15-20 frames for longer tail capture and tolerance for brief pausesnegativeSpeechThreshold: Currently 0.2-0.25. Try 0.15-0.2 to avoid premature segment endsTesting Approach
keepSegmentsconfig to persist captured audio segmentsRelated Code
Priority
Critical - This issue directly impacts core transcription accuracy and user experience.
Discovered during testing of dual-phase transcription feature (#256)