Optimizing Gemma 4 for offline Windows speech recognition in .NET apps

A new Windows desktop application is demonstrating how small yet powerful AI models can enable fully offline speech recognition without compromising performance. Built with .NET 10 and Avalonia UI, Parlotype lets users activate voice-to-text via a global hotkey, with all processing handled locally—no audio ever leaves the machine. The app recently integrated Google's Gemma 4, released in April 2026, as an alternative to the existing Whisper.net pipeline, giving users a choice between engines while maintaining consistent audio capture and text injection workflows.

Why Offline Speech Recognition Matters for Desktop Apps

Most voice-to-text solutions depend on cloud services, sending audio to remote servers for transcription. While convenient, this approach raises privacy concerns and requires reliable internet connectivity. Parlotype takes a different route by performing all speech recognition locally, ensuring complete data privacy regardless of the underlying model. The developer initially relied on Whisper.net for its robust performance on clean English dictation, but noticed performance drops with conversational or noisy audio.

Google's Gemma 4 introduced a conformer audio encoder that achieves 4.17% word error rate (WER) on LibriSpeech-test-clean benchmarks—performance comparable to much larger Whisper variants. For typical desktop dictation scenarios, where users speak clearly into focused text fields, Gemma 4's architecture offers a compelling alternative. The app now lets users toggle between Whisper and Gemma 4 engines in settings, with all audio processing handled through local pipelines including WASAPI capture, Silero voice activity detection (VAD), and direct text injection.

Choosing the Right Runtime: Why llama-server Won

Selecting an inference runtime for Gemma 4 required balancing several constraints: local processing, Windows desktop compatibility, single-installer deployment, cross-vendor GPU support, and avoiding Python dependencies. The developer evaluated multiple options before settling on llama-server, the HTTP server from llama.cpp.

Several alternatives fell short:

onnxruntime-genai lacked support for Gemma 4's architecture, particularly its per-layer embeddings and variable head dimensions (tracking issue #2062)
Python sidecar solutions introduced unnecessary dependencies like Python runtime and CUDA, complicating the installer for non-technical users
LLamaSharp's P/Invoke bindings required recompilation when switching between Vulkan and CUDA builds
Ollama didn't support Gemma audio integration at the time of evaluation
Lemonade was limited to AMD GPUs only

llama-server provided a unified solution with pre-built Vulkan and CUDA binaries for Windows, a stable OpenAI-compatible HTTP API at /v1/chat/completions with audio input support, and a release cadence that could be managed through in-app updates. This approach satisfied all deployment requirements while maintaining flexibility for future updates.

Benchmarking Five Gemma 4 Variants: Trade-offs Revealed

The Gemma 4 GGUF repository offers five quantization variants, each with distinct performance characteristics. The developer conducted comprehensive benchmarking against Whisper variants (Small, Medium, LargeV3Turbo) using 50 samples from LibriSpeech's test-other dataset—the more challenging English split. All tests ran on the same machine with CUDA acceleration and consistent warm-up methodology.

The results revealed surprising trade-offs between accuracy, speed, and model size:

Rank   Engine          Model                WER %   CER %   RTF   Load time (s)
1      Whisper (CUDA)  LargeV3Turbo         11.48   4.97    0.055  1.31
2      Whisper (CUDA)  Medium               12.18   5.41    0.073  1.28
3      Whisper (CUDA)  Small                13.10   5.87    0.034  0.71
4      Gemma 4         E2B-it-BF16           13.15   4.95    0.038  6.70
5      Gemma 4         E4B-it-Q4_K_M         13.82   5.80    0.038  6.73
6      Gemma 4         E4B-it-BF16           14.20   5.40    0.038  6.72
7      Gemma 4         E4B-it-Q8_0           14.39   5.79    0.044  9.25
8      Gemma 4         E2B-it-Q8_0           19.22   8.95    0.315  6.74

Key insights emerged from the data:

E2B-it-BF16 achieved the lowest character error rate (CER) at 4.95%, slightly outperforming Whisper LargeV3Turbo (4.97%)
The smaller E4B-it-Q4_K_M variant delivered competitive accuracy with minimal runtime overhead
E2B-it-Q8_0 performed poorly despite its larger size, suggesting potential quantization issues
Real-time factor (RTF) measurements showed Gemma 4 variants generally processed audio faster than Whisper LargeV3Turbo
Model load times varied significantly, with BF16 variants requiring substantially longer to initialize

The developer ultimately selected E4B-it-Q4_K_M as the default variant due to its optimal balance of accuracy, speed, and disk footprint (~5.9 GiB). This decision reflects the practical needs of typical desktop users who prioritize responsiveness and storage efficiency.

Looking Ahead: Local AI Meets Desktop Productivity

The successful integration of Gemma 4 into Parlotype demonstrates that sophisticated AI capabilities can thrive in offline environments without sacrificing performance. By giving users control over their speech recognition engine while maintaining strict data privacy, this approach addresses key concerns in modern desktop applications.

Future enhancements may include additional model variants, improved hotword detection, and expanded language support. As local AI models continue to evolve, developers building privacy-focused desktop applications will have more powerful tools at their disposal—without compromising speed or usability.

AI summary

Windows .NET masaüstü uygulamalarında yerel ses tanımada performans, disk alanı ve hız arasındaki dengeyi kurmak için Gemma 4'ün beş varyantını karşılaştırın.

Optimizing Gemma 4 for offline Windows speech recognition in .NET apps

Why Offline Speech Recognition Matters for Desktop Apps

Choosing the Right Runtime: Why llama-server Won

Benchmarking Five Gemma 4 Variants: Trade-offs Revealed

Looking Ahead: Local AI Meets Desktop Productivity

Comments

Debugging AI agents: why the root cause often lies upstream

How fixing Webpack's doc-kit led to contributions in Node.js core

How to prevent race conditions in Next.js optimistic UI updates