How to Run Gemma 4 12B Locally: Ollama, llama.cpp & Transformers (Text, Image, Audio)
Read time: ~9 minutes. What you’ll learn: three ways to run Gemma 4 12B on your own machine — Ollama (fastest), llama.cpp with a 4-bit GGUF (the 16GB sweet spot), and Transformers (full multimodal: text + image + audio). Plus the sampling settings that actually matter, an OpenAI-compatible server option, and how to choose. Every command is taken verbatim from Google’s developer guide, the official Hugging Face model card, and Unsloth’s run docs — copy-paste safe.
Google released Gemma 4 12B on June 3, 2026 — and the headline for builders isn’t the parameter count, it’s that a genuinely multimodal model (text, images, and native audio) fits on a 16GB laptop under an Apache 2.0 license. For the why-it-matters story — the encoder-free architecture and how the 12B nearly matches the 26B — see the Gemma 4 12B release breakdown. This guide is the hands-on part: getting it running on hardware you already own.
1. What you’re setting up
Gemma 4 12B is a 12-billion-parameter open-weight model with three properties that change how you deploy it locally:
- Multimodal in one model: it accepts text, images, and native audio — no separate vision or audio service to stand up.
- Encoder-free: instead of bolting a CLIP-style vision encoder and a separate audio encoder onto the language model, Gemma 4 folds both directly into the transformer. That’s why a model that reads images and listens to audio still fits in 16GB. (Source: Google, 2026-06-03)
- 256K-token context for the 12B variant. (Source: Hugging Face model card, 2026-06-03)
License is Apache 2.0, so you can run it and ship it commercially with no asterisk. Two checkpoints matter:
google/gemma-4-12B— base (pre-trained).google/gemma-4-12B-it— instruction-tuned. This is the one you want for chat, extraction, and following instructions.
2. Hardware you need
The full BF16 weights are ~24GB, but you almost never run those locally. The practical path is a 4-bit quantization, which is what makes the 16GB claim real:
- 16GB RAM/VRAM (laptops, most GPUs): use the dynamic 4-bit quant. Unsloth’s
UD-Q4_K_XLneeds roughly 7–8GB for the weights, leaving room for context and images. (Source: Unsloth docs, 2026-06-03) - Unified memory (Apple Silicon): a 16GB M-series Mac runs the 4-bit GGUF or MLX build comfortably.
- 24GB+ GPU: you can run higher-precision quants or the full Transformers stack for the cleanest multimodal behavior.
The memory cost is the weights plus the tokens — images and long audio expand into many tokens, so leave headroom beyond the 7–8GB weight footprint if you’re feeding large images or 256K-context prompts.
3. Option A — Ollama (fastest start)
If you just want it running in one command, Ollama pulls the official build:
ollama run gemma4:12b
That’s the official library tag. Other sizes exist (gemma4:e2b, gemma4:e4b, gemma4:26b, gemma4:31b) — 12b is the workstation sweet spot. (Source: Ollama library, 2026-06-03)
If you want Unsloth’s dynamic 4-bit quant specifically (the one tuned for 16GB), pull it straight from Hugging Face through Ollama:
ollama run hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL
(Source: Unsloth docs, 2026-06-03)
Ollama’s Gemma 4 build supports text and image input — drag an image path into the prompt or pass it via the API. (Source: Ollama library, 2026-06-03) For audio, use the Transformers path in §5, which exposes the native audio modality directly.
Native audio in is the feature most local runtimes are still catching up on. If audio is your use case, jump to §5 — don’t assume every GGUF runtime wires it up yet.
4. Option B — llama.cpp with a 4-bit GGUF (the 16GB recipe)
This is the path that maps most directly onto the “runs on 16GB” promise, and it gives you an OpenAI-compatible server you can point any client at.
4.1 One-shot CLI
export LLAMA_CACHE="unsloth/gemma-4-12B-it-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL \
--temp 1.0 \
--top-p 0.95 \
--top-k 64
The -hf flag pulls the GGUF from Hugging Face on first run and caches it. UD-Q4_K_XL is the dynamic 4-bit quant recommended for 16GB. (Source: Unsloth docs, 2026-06-03)
4.2 OpenAI-compatible server
./llama.cpp/llama-server \
--model unsloth/gemma-4-12B-it-GGUF/gemma-4-12B-it-UD-Q4_K_XL.gguf \
--temp 1.0 \
--top-p 0.95 \
--top-k 64
That starts a local server you can hit with the standard OpenAI chat-completions shape. If you want the deeper story on llama.cpp’s server — including built-in tool calling — see llama.cpp’s built-in tools.
4.3 The sampling settings are not optional
Gemma 4’s recommended decoding is temperature 1.0, top-p 0.95, top-k 64. (Source: Unsloth docs, 2026-06-03) These aren’t the defaults on most runtimes — if you leave temperature at 0.7 or skip top-k, you’ll get noticeably worse output and may blame the model. Set all three.
5. Option C — Transformers (full multimodal: image + audio)
This is the path that exposes everything Gemma 4 can do, including native audio — the modality the lighter runtimes don’t all support yet.
5.1 Install
pip install -U transformers torch accelerate
For audio (and video-as-frames) you also need:
pip install -U transformers torch torchvision librosa accelerate
(Source: Hugging Face model card, 2026-06-03)
5.2 Load the model
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-12B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
device_map="auto" spreads the model across whatever you have (GPU, or CPU + RAM). dtype="auto" picks the right precision.
5.3 Text generation
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short joke about saving RAM."},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)
Two Gemma-4 specifics worth noting: enable_thinking=False turns off the model’s reasoning trace (set it True when you want chain-of-thought), and processor.parse_response(...) cleans up the raw decoded string into the final answer. (Source: Hugging Face model card, 2026-06-03)
5.4 Pass an image (place it before the text)
messages = [
{
"role": "user", "content": [
{"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/GoldenGate.png"},
{"type": "text", "text": "What is shown in this image?"}
]
}
]
The image goes first in the content list, the question after. Run it through the same apply_chat_template → generate → parse_response flow from §5.3.
5.5 Pass audio (place it after the text)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
{"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/journal1.wav"},
]
}
]
Note the ordering flips for audio: text first, audio after. This is straight from the official model card — getting the order wrong is a common reason multimodal prompts misbehave. (Source: Hugging Face model card, 2026-06-03)
The encoder-free design is what makes this work without a separate speech model: raw 16kHz audio is projected directly into the token space, so transcription, audio Q&A, and analysis all run through the same generate call as text and images.
6. Optional — LiteRT-LM as an OpenAI-compatible server
Google ships a LiteRT-LM path for on-device serving with an OpenAI-compatible API:
litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm gemma-4-12B-it.litertlm gemma4-12b
litert-lm serve
litert-lm serve exposes an OpenAI-compatible endpoint, so existing OpenAI-SDK code points at it with just a base-URL change. (Source: Google developer guide, 2026-06-03)
7. Which path should you pick?
| Your situation | Use | Why |
|---|---|---|
| ”Just run it now” | Ollama (gemma4:12b) | One command, text + image |
| 16GB laptop, want a server | llama.cpp + UD-Q4_K_XL | Smallest footprint, OpenAI-compatible |
| Need native audio | Transformers | Only path that exposes the audio modality cleanly |
| Apple Silicon | Ollama or MLX build | Unified memory runs 4-bit comfortably |
| Existing OpenAI-SDK app | llama.cpp server or LiteRT-LM | Drop-in base-URL swap |
8. Gotchas
- Use the
-itcheckpoint (google/gemma-4-12B-it) for anything instruction-following. The basegoogle/gemma-4-12Bis for fine-tuning, not chat. - Set the sampling trio — temperature 1.0, top-p 0.95, top-k 64. Default decoding settings will make the model look worse than it is.
- Image-before-text, text-before-audio. The content ordering differs by modality (§5.4 vs §5.5).
- Audio support isn’t universal yet across GGUF runtimes — if audio is core to your use case, validate on the Transformers path before committing to a llama.cpp/Ollama setup.
- Budget memory for tokens, not just weights. 7–8GB is the weight footprint; large images and 256K-context prompts add to that.
9. Where to go next
- For the architecture and benchmark story — why the 12B nearly matches the 26B — read the Gemma 4 12B release breakdown.
- Building a local document/vision pipeline? The same Transformers toolchain powers running NuExtract 3 locally for structured JSON extraction.
- Want a local coding baseline to compare against? See running Qwen 3.6 locally for coding.
- Serving with tools? llama.cpp’s built-in tool calling covers the inference layer.
The win with Gemma 4 12B isn’t that it’s a frontier model — it isn’t. It’s that a text-image-audio model now runs on the laptop you already own, under a license that lets you ship it. Check it against your real tasks before retiring a hosted setup, but for local multimodal, this is the new baseline.
Sources
- Gemma 4 12B: The Developer Guide — Google Developers Blog, 2026-06-03
- google/gemma-4-12B-it model card — Hugging Face, 2026-06-03
- Gemma 4 — How to Run Locally — Unsloth Documentation, 2026-06-03
- Introducing Gemma 4 12B — Google (The Keyword), 2026-06-03