A new Windows desktop application is demonstrating how small yet powerful AI models can enable fully offline speech recognition without compromising performance. Built with .NET 10 and Avalonia UI, Parlotype lets users activate voice-to-text via a global hotkey, with all processing handled locally—no audio ever leaves the machine. The app recently integrated Google's Gemma 4, released in April 2026, as an alternative to the existing Whisper.net pipeline, giving users a choice between engines while maintaining consistent audio capture and text injection workflows.
Why Offline Speech Recognition Matters for Desktop Apps
Most voice-to-text solutions depend on cloud services, sending audio to remote servers for transcription. While convenient, this approach raises privacy concerns and requires reliable internet connectivity. Parlotype takes a different route by performing all speech recognition locally, ensuring complete data privacy regardless of the underlying model. The developer initially relied on Whisper.net for its robust performance on clean English dictation, but noticed performance drops with conversational or noisy audio.
Google's Gemma 4 introduced a conformer audio encoder that achieves 4.17% word error rate (WER) on LibriSpeech-test-clean benchmarks—performance comparable to much larger Whisper variants. For typical desktop dictation scenarios, where users speak clearly into focused text fields, Gemma 4's architecture offers a compelling alternative. The app now lets users toggle between Whisper and Gemma 4 engines in settings, with all audio processing handled through local pipelines including WASAPI capture, Silero voice activity detection (VAD), and direct text injection.
Choosing the Right Runtime: Why llama-server Won
Selecting an inference runtime for Gemma 4 required balancing several constraints: local processing, Windows desktop compatibility, single-installer deployment, cross-vendor GPU support, and avoiding Python dependencies. The developer evaluated multiple options before settling on llama-server, the HTTP server from llama.cpp.
Several alternatives fell short:
onnxruntime-genailacked support for Gemma 4's architecture, particularly its per-layer embeddings and variable head dimensions (tracking issue #2062)- Python sidecar solutions introduced unnecessary dependencies like Python runtime and CUDA, complicating the installer for non-technical users
- LLamaSharp's P/Invoke bindings required recompilation when switching between Vulkan and CUDA builds
- Ollama didn't support Gemma audio integration at the time of evaluation
- Lemonade was limited to AMD GPUs only
llama-server provided a unified solution with pre-built Vulkan and CUDA binaries for Windows, a stable OpenAI-compatible HTTP API at /v1/chat/completions with audio input support, and a release cadence that could be managed through in-app updates. This approach satisfied all deployment requirements while maintaining flexibility for future updates.
Benchmarking Five Gemma 4 Variants: Trade-offs Revealed
The Gemma 4 GGUF repository offers five quantization variants, each with distinct performance characteristics. The developer conducted comprehensive benchmarking against Whisper variants (Small, Medium, LargeV3Turbo) using 50 samples from LibriSpeech's test-other dataset—the more challenging English split. All tests ran on the same machine with CUDA acceleration and consistent warm-up methodology.
The results revealed surprising trade-offs between accuracy, speed, and model size:
Rank Engine Model WER % CER % RTF Load time (s)
1 Whisper (CUDA) LargeV3Turbo 11.48 4.97 0.055 1.31
2 Whisper (CUDA) Medium 12.18 5.41 0.073 1.28
3 Whisper (CUDA) Small 13.10 5.87 0.034 0.71
4 Gemma 4 E2B-it-BF16 13.15 4.95 0.038 6.70
5 Gemma 4 E4B-it-Q4_K_M 13.82 5.80 0.038 6.73
6 Gemma 4 E4B-it-BF16 14.20 5.40 0.038 6.72
7 Gemma 4 E4B-it-Q8_0 14.39 5.79 0.044 9.25
8 Gemma 4 E2B-it-Q8_0 19.22 8.95 0.315 6.74Key insights emerged from the data:
E2B-it-BF16achieved the lowest character error rate (CER) at 4.95%, slightly outperforming Whisper LargeV3Turbo (4.97%)- The smaller
E4B-it-Q4_K_Mvariant delivered competitive accuracy with minimal runtime overhead E2B-it-Q8_0performed poorly despite its larger size, suggesting potential quantization issues- Real-time factor (RTF) measurements showed Gemma 4 variants generally processed audio faster than Whisper LargeV3Turbo
- Model load times varied significantly, with BF16 variants requiring substantially longer to initialize
The developer ultimately selected E4B-it-Q4_K_M as the default variant due to its optimal balance of accuracy, speed, and disk footprint (~5.9 GiB). This decision reflects the practical needs of typical desktop users who prioritize responsiveness and storage efficiency.
Looking Ahead: Local AI Meets Desktop Productivity
The successful integration of Gemma 4 into Parlotype demonstrates that sophisticated AI capabilities can thrive in offline environments without sacrificing performance. By giving users control over their speech recognition engine while maintaining strict data privacy, this approach addresses key concerns in modern desktop applications.
Future enhancements may include additional model variants, improved hotword detection, and expanded language support. As local AI models continue to evolve, developers building privacy-focused desktop applications will have more powerful tools at their disposal—without compromising speed or usability.
AI summary
Windows .NET masaüstü uygulamalarında yerel ses tanımada performans, disk alanı ve hız arasındaki dengeyi kurmak için Gemma 4'ün beş varyantını karşılaştırın.