How to extract speech segments from audio using Silero VAD and ONNX

Voice Activity Detection (VAD) transforms raw audio into usable speech segments before transcription or analysis, saving time and improving accuracy. A new approach combines Silero VAD’s lightweight ONNX model with ONNX Runtime to detect speech in real time—even on consumer-grade hardware. This method processes 14 seconds of conversation in just 0.028 seconds, opening practical uses from call center analytics to podcast editing.

Why isolate speech segments before processing

Long audio files often contain extended periods of silence, background noise, or non-speech audio. These segments add unnecessary processing load and can dilute the effectiveness of downstream tasks like transcription or sentiment analysis. By isolating only the parts where speech occurs, tools can focus computational resources where they matter most.

For example, a 15-minute customer service call might contain only 7 minutes of actual conversation. Extracting those segments first reduces storage needs, speeds up transcription engines, and improves the signal-to-noise ratio for speech recognition models. This preprocessing step is especially valuable when working with batch pipelines or real-time systems constrained by latency.

Setting up the Silero VAD pipeline with ONNX Runtime

The process begins with an MP3 file of a two-person conversation. The audio is first decoded into a 16 kHz, mono waveform using FFmpeg. This ensures compatibility with Silero VAD’s input requirements. The waveform is then divided into fixed-size chunks of 512 samples—equivalent to 32 milliseconds—each processed independently by the model.

Key configuration settings control how segments are detected:

Speech threshold: 0.5 — the probability level required to start a speech segment
Negative threshold: 0.35 — the lower level needed to end a segment
Minimum silence duration: 100 ms — prevents premature endings due to brief pauses
Minimum speech duration: 250 ms — filters out ultra-short noise or artifacts
Speech padding: 30 ms — adds small buffers at segment boundaries for smoother extraction

The model maintains internal state and carries context across chunks, enabling accurate detection even when speech boundaries span chunk boundaries. This streaming approach is essential for real-world applications where audio arrives continuously.

Running the detection and exporting segments

The full pipeline is reproducible using a lightweight setup. Prerequisites include mise for environment management, uv for dependency resolution, FFmpeg for audio decoding, and curl for file transfers. The process is automated through a set of commands that clone the lab repository, download test assets, and execute the detector:

git clone --depth 1 --filter=blob:none --sparse 
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml 2026/07/03/silero-vad
make download-test-assets
mise -C 2026/07/03/silero-vad run

On first execution, the script automatically downloads the silero_vad.onnx model from the official repository. uv then resolves Python dependencies and launches the inference engine using ONNX Runtime’s CPU execution provider—no GPU required.

Each detected speech interval is saved as a separate WAV file in the output/ directory. The files are encoded as 16-bit PCM, 16 kHz, mono, matching the original input specifications. Existing files are cleared before each run to prevent overlap.

Results show high accuracy with minimal latency

In a test on a Mac Studio with an Apple M4 Max chip, the detector identified 12 speech segments from a 14.171-second conversation. Total processing time for inference alone was just 0.028 seconds, yielding a real-time factor of 0.002x—meaning the model processed audio 500 times faster than real time.

The extracted segments contained approximately 11.917 seconds of speech, covering 84.1% of the input. Natural pauses aligned closely with sentence boundaries in Japanese, and even short responses like “yeah” and “thanks” were preserved. Only one segment split due to a brief pause between two words, demonstrating the model’s sensitivity to context rather than isolated noise.

The environment used Python 3.12.11, ONNX Runtime 1.27.0, and the CPU-only execution provider. No warm-up runs were performed, so actual performance may vary slightly in production scenarios.

As voice AI models grow more capable, preprocessing audio to isolate speech becomes a critical efficiency step. Combining Silero VAD with ONNX Runtime delivers fast, CPU-friendly speech segmentation that scales from local scripts to cloud pipelines—without sacrificing accuracy or adding hardware complexity.

AI summary

Konuşma aktivitesini otomatik olarak tespit eden Silero VAD modelini ONNX Runtime ile kullanarak ses kayıtlarınızı verimli bir şekilde segmentlere ayırın. Kullanım adımları ve performans sonuçları burada.

How to extract speech segments from audio using Silero VAD and ONNX

Why isolate speech segments before processing

Setting up the Silero VAD pipeline with ONNX Runtime

Running the detection and exporting segments

Results show high accuracy with minimal latency

Comments

Why Warp’s open-source terminal could redefine AI-driven software development

Why Rust Powered My Startup’s AI Docs Tool—And What I Learned in the Process

How bilateral signatures transform AI agent proof with cryptographic trust