Skip to content
OBLAIDISH NEWS
Voxis captures Windows system audio driverlessly
TX_360095Engineering

Voxis captures Windows system audio driverlessly

Voxis ships a Windows-only audio capture engine that reads the post-mix output at 16 kHz and streams it to a Gemini translation model without installing any virtual audio driver [DevTo].

Voxis now ships a driverless Windows audio capture engine that streams system output to a translation model without installing any virtual audio cable [DevTo]. The app activates the ApplicationLoopback API introduced in Windows 10 version 2004, creating an IAudioClient in process-loopback mode that excludes the caller’s own process tree [Microsoft Docs].

The activation call uses a PROPVARIANT BLOB with PROCESS_LOOPBACK_MODE_EXCLUDE_TARGET_PROCESS_TREE and the magic device name VAD\Process_Loopback. The COM completion handler must implement both IActivateAudioInterfaceCompletionHandler and the marker interface IAgileObject; omitting the latter triggers E_ILLEGAL_METHOD_CALL.

Voxis requests a 16 kHz, mono, 16-bit PCM format directly, matching the Gemini model’s input, so the capture path contains no resampling [DevTo]. The client is initialized with a 200 ms buffer. Capture and processing run on separate threads linked by a bounded collections.deque(maxlen=64). The capture thread only calls GetBuffer, copies the bytes into a NumPy array, and releases the buffer; it never blocks on downstream VAD or network code.

When the translation plays back, Voxis ducks the source application’s audio session via the Windows session-volume API (ISimpleAudioVolume through pycaw) instead of mixing, keeping the original path untouched and avoiding self-feedback [Microsoft Docs].

The implementation shows that Python, with COM interop via comtypes, can meet the hard real-time constraints of WASAPI loopback, challenging the assumption that only native C++ code can achieve low-latency capture on Windows [DevTo]. The bounded-queue design guarantees that a slow downstream stage never overflows the WASAPI ring buffer, a pattern that other developers can adopt for live transcription, noise suppression, or game audio analytics.

By exposing the hidden IAgileObject requirement and proving that a pure-Python stack can sustain a 200 ms buffer without glitches, the project lowers the barrier for anyone building real-time audio processors on Windows [DevTo].

operator_channel
[ comments_offline · provider_not_configured ]
transmission_log

Subscribe to the broadcast.

Daily digest of the day's most important tech news. No fluff. Engineering signal only.

// delivered via substack · double-opt-in confirmation