Run Gemma 4 12B Locally on Windows with WSL2 and llama.cpp

Setting up large language models like Gemma 4 12B on a local machine can feel overwhelming, especially when your primary operating system is Windows. With the right tools, however, you can run the model efficiently using Windows Subsystem for Linux 2 (WSL2) and the lightweight inference engine llama.cpp. This step-by-step guide walks you through every stage—from preparing your WSL2 environment to launching the model with or without GPU acceleration.

Preparing Your WSL2 Environment for Local AI Inference

Before diving into model setup, ensure your WSL2 environment is current and optimized. Start by updating the package list and upgrading existing packages to avoid compatibility issues later:

sudo apt update && sudo apt upgrade -y

This command refreshes all system packages, including critical dependencies for building and running AI models. Keeping your environment updated reduces the risk of installation failures and improves overall system stability during model inference.

Installing Required Dependencies and GPU Toolkit

Next, install essential build tools and libraries needed to compile llama.cpp:

sudo apt install build-essential cmake git -y

If your system includes an NVIDIA GPU, you’ll benefit from GPU-accelerated inference. To enable CUDA support, install the NVIDIA CUDA toolkit:

sudo apt install nvidia-cuda-toolkit -y

Run nvidia-smi in your terminal to confirm GPU detection. If the toolkit detects your graphics card, you’re ready to leverage CUDA for faster model performance. Note that this step may take several minutes depending on your internet connection and system specifications.

Building llama.cpp for CPU and GPU Execution

With dependencies installed, clone the llama.cpp repository and build the inference engine. The process varies slightly depending on whether you plan to use GPU acceleration or rely solely on CPU processing.

Building with CUDA Support (Recommended for GPU)

To compile llama.cpp with CUDA enabled, use the following commands:

git clone 
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_OPENSSL=ON
cmake --build build --config Release

These commands fetch the latest version of llama.cpp, configure the build for CUDA support, and compile the engine in release mode. This configuration ensures optimal performance when running the model on systems equipped with compatible NVIDIA GPUs.

Building Without GPU Acceleration

If your system lacks a supported GPU or you prefer CPU-only execution, build llama.cpp without CUDA flags:

git clone 
cd llama.cpp
cmake -B build
cmake --build build --config Release

This streamlined build process focuses on CPU compatibility and produces a binary capable of running the model, though at reduced speed compared to GPU-accelerated setups.

Downloading and Running the Gemma 4 12B Model

Once llama.cpp is built, the next step is obtaining the model weights. Gemma 4 12B is available in a quantized format optimized for local inference, hosted on Hugging Face.

Optional: Pre-Downloading the Model File

To avoid runtime delays, you can download the model file in advance:

mkdir -p models
wget -O models/gemma-4-12b-it-UD-Q4_K_XL.gguf

This command creates a dedicated models directory and downloads the quantized Gemma 4 12B model file, which is significantly smaller than the original version and ideal for local execution.

Launching Inference via CLI

To interact with the model directly from the command line, use the llama-cli binary with the following command:

./build/bin/llama-cli -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL

This launches an interactive session where you can input prompts and receive real-time responses. The model processes input efficiently, delivering coherent and contextually relevant answers.

Running a Web-Based Interface for Easier Access

For those who prefer a graphical user interface, llama.cpp includes a built-in web server. Start the server with the following command to enable browser-based interactions:

./build/bin/llama-server -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL --port 8080

After launching, open your browser and navigate to ` to access the web UI. This interface simplifies model interaction, making it accessible even to users unfamiliar with command-line tools.

The combination of WSL2 and llama.cpp unlocks the potential to run advanced AI models locally on Windows without relying on cloud services. Whether you choose CLI or web-based access, this setup delivers flexibility, performance, and full control over your AI workflows. As open-source tools continue to evolve, local inference becomes more accessible, empowering developers and enthusiasts to experiment with cutting-edge models at their own pace.

AI summary

Windows Subsystem for Linux 2 (WSL2) kullanarak yerel bilgisayarınızda Gemma-4 12B modelini nasıl kurabileceğinizi ve çalıştırabileceğinizi adım adım öğrenin. Tüm bağımlılıklar ve GPU destekli kurulum dahil.