How a driverless Windows app captures system audio for real-time translation

A core challenge in building audio-processing tools for Windows is capturing what the user actually hears without installing drivers, virtual audio cables, or bots. A new open-source app called Voxis solves this by tapping into the system’s post-mix output directly, processing it in real time, and playing back translations—all without interrupting the original audio stream.

Developed in Python, Voxis demonstrates how to capture system audio at 16 kHz mono using Windows’ built-in APIs, then stream it to a translation model while avoiding common pitfalls like feedback loops and latency spikes. The project also highlights the importance of real-time safety, especially when bridging audio capture with computationally heavy tasks like speech recognition and translation.

The core constraints: no drivers, no feedback, no lag

Voxis was designed to meet three strict requirements from day one:

Driverless operation. If a tool requires a driver install or system reboot, it fails the zero-setup test. Users shouldn’t need admin rights or technical know-how to get started.

No self-capture. When the app plays translated audio back into the system, it must not re-capture its own output. Otherwise, the system would translate its own translations—a feedback loop that breaks the entire process.

Real-time safety. The audio capture loop cannot stall. If downstream processing slows down, the system must gracefully drop old packets rather than causing buffer overflows or glitches.

These constraints shaped every line of Voxis’s audio capture engine, which relies entirely on Windows APIs introduced in version 2004.

Capturing system audio with ApplicationLoopback

Windows 10 version 2004 introduced the ApplicationLoopback API, a feature that allows developers to activate an IAudioClient in loopback mode—but with a crucial twist: it can exclude the current process from the captured mix. This is exactly what Voxis needs to avoid capturing its own output.

To use this API, you don’t go through the standard IMMDeviceEnumerator route. Instead, you activate the loopback client by name using ActivateAudioInterfaceAsync, passing a PROPVARIANT that contains a BLOB with loopback parameters:

params = AUDIOCLIENT_ACTIVATION_PARAMS()
params.ActivationType = AUDIOCLIENT_ACTIVATION_TYPE_PROCESS_LOOPBACK
params.u.ProcessLoopbackParams.TargetProcessId = my_pid
params.u.ProcessLoopbackParams.ProcessLoopbackMode = \
    PROCESS_LOOPBACK_MODE_EXCLUDE_TARGET_PROCESS_TREE

pv = PROPVARIANT()
pv.vt = VT_BLOB
pv.blob.cbSize = sizeof(params)
pv.blob.pBlobData = ctypes.cast(byref(params), c_void_p)

The device name is the special string VAD\Process_Loopback. The activation is asynchronous, so the app waits for a completion handler to fire once the client is ready.

The IAgileObject trap: why COM objects must declare apartment agnosticism

During development, one unexpected error cost an entire afternoon: ActivateAudioInterfaceAsync returned E_ILLEGAL_METHOD_CALL with no explanation. The root cause was subtle: the completion handler implemented only IActivateAudioInterfaceCompletionHandler, but the WASAPI API required it to also implement IAgileObject—a marker interface with no methods that signals the object can be called from any apartment.

The fix was simple once identified. Adding IAgileObject to the COM interface list resolved the issue:

class _Handler(COMObject):
    _com_interfaces_ = [
        IActivateAudioInterfaceCompletionHandler,
        IAgileObject
    ]

IAgileObject doesn’t add any methods; it’s purely a promise that the object is apartment-agnostic. Without it, WASAPI refuses to proceed—even though the error message never mentions it.

Requesting the exact audio format to avoid resampling overhead

Another key optimization was eliminating unnecessary resampling. Voxis initializes the loopback client with the specific format the translation model expects: 16 kHz, mono, 16-bit PCM. This matches the input requirements of most speech-to-text and translation services, so no conversion is needed in the hot path:

wfx.nChannels = 1
wfx.nSamplesPerSec = 16000
wfx.wBitsPerSample = 16

client.Initialize(
    AUDCLNT_SHAREMODE_SHARED,
    AUDCLNT_STREAMFLAGS_LOOPBACK,
    2_000_000,  # 200 ms buffer in 100-ns units
    0,
    byref(wfx),
    None
)

This direct match saves CPU cycles and reduces latency, which is critical for real-time applications.

Splitting capture and processing with a bounded queue

The toughest part of real-time audio capture isn’t reading the data—it’s ensuring the capture thread never stalls. If downstream processing (like voice activity detection or translation) slows down, the Windows Audio Session API (WASAPI) ring buffer could overflow, causing glitches or crashes.

Voxis solves this by running capture and processing on separate threads, connected by a bounded queue:

Capture thread:
Calls GetNextPacketSize, GetBuffer, copies data into a NumPy array, then ReleaseBuffer
Appends the packet to a collections.deque with a fixed maximum length
Never blocks or runs heavy processing

Processor thread:
Drains the queue and handles VAD, translation, and playback
Can afford to be slower because it only processes what’s available

The queue is defined with maxlen=64, which acts as a natural buffer:

self._queue = collections.deque(maxlen=64)  # ~one buffer's worth of packets

If the processor lags, the oldest packets are dropped—bounding latency instead of blocking the capture thread. This single design choice prevents buffer overflows even during garbage collection pauses or network delays.

Ducking system audio without touching the audio stream

When Voxis plays a translation, it needs to lower the volume of the original audio to avoid overlapping voices. Instead of mixing the audio internally (which would require handling playback, latency, and device routing), the app uses Windows’ session-volume API (ISimpleAudioVolume via the pycaw library).

This API lets Voxis adjust the volume of the specific audio session associated with the app playing the original content. The adjustment is temporary: when the translation stops, the volume returns to normal. This approach has three key benefits:

The original audio continues playing through its normal path
No added latency or routing complexity
No need to manage playback or device selection

This technique is especially useful for applications like games or video calls, where preserving the original audio experience is critical.

The road ahead: expanding beyond loopback capture

While Voxis currently relies solely on the ApplicationLoopback API for driverless capture, the project also includes a secondary path for users who install a virtual audio cable. In that mode, Voxis can apply advanced techniques like mid/side center suppression to isolate dialogue and improve translation accuracy.

Looking forward, the team behind Voxis is exploring ways to integrate noise suppression, echo cancellation, and multi-channel audio support—all while maintaining the driverless promise. The goal remains clear: deliver seamless real-time translation without disrupting the user’s workflow or requiring technical setup.

AI summary

Sanal kablo ya da sürücü yüklemesi gerektirmeyen Voxis uygulaması, Windows sistem seslerini anında tercüme ediyor. Geliştirme sürecindeki teknik detaylar ve çözümler burada.

How a driverless Windows app captures system audio for real-time translation

The core constraints: no drivers, no feedback, no lag

Capturing system audio with ApplicationLoopback

The IAgileObject trap: why COM objects must declare apartment agnosticism

Requesting the exact audio format to avoid resampling overhead

Splitting capture and processing with a bounded queue

Ducking system audio without touching the audio stream

The road ahead: expanding beyond loopback capture

Comments

Why your AWS bill spiked after low-traffic app launches

Fix product recalls with real-time blocking and strong consistency

AWS Lambda MicroVMs: Stateful Serverless Computing Without EC2 Overhead