Add streaming Silero VAD runner for real-time speech detection#18507
Merged
seyeong-han merged 4 commits intopytorch:mainfrom Mar 31, 2026
Merged
Add streaming Silero VAD runner for real-time speech detection#18507seyeong-han merged 4 commits intopytorch:mainfrom
seyeong-han merged 4 commits intopytorch:mainfrom
Conversation
Add a new `silero_vad_stream_runner` CLI that reads 16kHz mono float32 PCM from stdin and outputs per-frame speech probabilities via a simple line protocol (`PROB <time> <probability>`). This enables real-time VAD as a subprocess for apps like the Voxtral Realtime macOS dictation app. Changes: - Add `reset_stream()` and `process_frame()` to SileroVadRunner for stateful frame-by-frame inference with persistent LSTM state - Add `stream_main.cpp` as the streaming CLI entry point - Update CMakeLists.txt to build both `silero_vad_runner` (offline) and `silero_vad_stream_runner` (streaming) targets - Remove unnecessary `extension_llm_runner` dependency that caused build conflicts with sentencepiece headers - Update Makefile `silero-vad-cpu` target to build both runners with `-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=OFF` - Update README with streaming usage and architecture docs Authored with assistance from Claude. Made-with: Cursor
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18507
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
8 tasks
mergennachin
approved these changes
Mar 26, 2026
Adjust `stream_main.cpp` to match the formatter output so the remaining lintrunner failure is resolved without changing behavior. Authored with assistance from Claude. Made-with: Cursor
Rewrite the silero VAD runner link-library lists to match cmake-format so the remaining lintrunner failure is cleared without changing build behavior. Authored with assistance from Claude. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a streaming CLI entry point (
silero_vad_stream_runner) for the Silero VAD model that enables real-time, frame-by-frame voice activity detection from stdin. This powers the "hey torch" wake-up feature in the Voxtral Realtime macOS app.Changes
New:
silero_vad_stream_runnerA CLI that reads 16kHz mono float32 PCM from stdin and outputs per-frame speech probabilities via a line protocol:
This enables any app to run Silero VAD as a subprocess — pipe audio in, parse probabilities out. The Voxtral macOS app uses this for hands-free wake-up detection.
New: Streaming API on
SileroVadRunnerreset_stream()— re-initialize LSTM state and context buffersprocess_frame(audio_data, num_samples)— process a single 512-sample chunk, return speech probability, carry LSTM state forwardThe existing
detect()method now usesprocess_frame()internally, so offline and streaming paths share the same inference code.Build changes
CMakeLists.txt— addsilero_vad_stream_runnertarget alongsidesilero_vad_runnerextension_llm_runnerlink dependency that causedstring_viewambiguity with sentencepiece headersMakefilesilero-vad-cputarget — build both runners, configure with-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=OFFREADME.md— document streaming usage, architecture, and line protocolUsage
Test plan
make silero-vad-cpubuilds bothsilero_vad_runnerandsilero_vad_stream_runnerAuthored with assistance from Claude.
Made with Cursor