How to Make Gemma 4 Run up to 2x Faster Locally: Multi-Token Prediction (MTP) + QAT

Read time: ~8 minutes. What you’ll learn: the two independent levers that make local Gemma 4 faster — QAT (4-bit weights trained for it, ~72% less memory) and MTP / multi-token prediction (a self-speculative drafter that roughly doubles decode speed with no quality loss). How to turn MTP on in Ollama, mainline llama.cpp (merged June 7, 2026), and Transformers, how to stack it with a QAT GGUF, and the honest speedup numbers per hardware. Every command and figure below is from the merged llama.cpp PR, Google’s MTP docs, and Unsloth’s QAT docs — with third-party benchmarks clearly marked as such.

If you already have Gemma 4 running locally — and if not, start with how to run Gemma 4 12B locally — there are two separate ways to make it faster without touching quality. Most write-ups blur them together. They are not the same thing, they solve different problems, and you can use both at once.

This guide is the speed layer on top of the basic setup. It exists now because the bigger of the two levers, multi-token prediction, landed in mainline llama.cpp on June 7, 2026 (Source: llama.cpp PR #23398) — so for the first time you can get it without a fork.


1. The two levers, and why people keep confusing them

When someone says “I got Gemma 4 twice as fast,” they’re usually doing one of two unrelated things:

  • QAT (Quantization-Aware Training) shrinks the weights. The model is trained to survive 4-bit compression, so the 4-bit version keeps near-original quality instead of degrading like a normal post-hoc quant. This buys you less memory (and the smaller footprint is incidentally a bit faster to move around). It does not change how many tokens you generate per forward pass.
  • MTP (Multi-Token Prediction) shrinks the number of forward passes. A small drafter predicts several tokens ahead; the full model verifies them in parallel and accepts the ones it agrees with. This buys you higher decode throughput — more tokens per second — and the verification step is lossless, so output is identical to normal decoding.

The one-line version: QAT makes the model fit; MTP makes it spit. They’re orthogonal. Stack both and you get a small, fast model that still answers like the full one.

The rest of this guide does QAT first (it’s the foundation everyone should already be on), then MTP (the new part), then how to run them together.


2. Lever 1 — QAT: smaller weights, near-zero quality cost

Google ships QAT checkpoints for Gemma 4, and Unsloth packages them as GGUFs. The headline: ~72% lower memory for the 4-bit format with near-original performance, because the model was trained for 4-bit rather than squeezed into it afterward. (Source: Unsloth QAT docs)

2.1 The quant to use

Unsloth uploads these as UD-Q4_K_XL specifically. Their note is worth internalizing: “precisions higher than the uploaded UD-Q4_K_XL version degrade accuracy rather [than] improve it” — so bigger is not better here. (Source: Unsloth QAT docs)

2.2 Memory per size

This is the table that decides what you can run:

Gemma 4 QAT modelRAM/VRAM neededHF repo
E2B3 GBunsloth/gemma-4-E2B-it-qat-GGUF
E4B5 GBunsloth/gemma-4-E4B-it-qat-GGUF
12B7 GBunsloth/gemma-4-12b-it-qat-GGUF
26B-A4B (MoE)15 GBunsloth/gemma-4-26B-A4B-it-qat-GGUF
31B18 GBunsloth/gemma-4-31B-it-qat-GGUF

(Source: Unsloth QAT docs.) That’s the 7 GB for the 12B that people keep quoting — it’s the QAT number, not a regular quant.

2.3 Run a QAT GGUF

Same llama.cpp invocation as a normal GGUF, just pointed at the QAT repo, with Gemma 4’s required sampling settings:

./llama.cpp/llama-cli \
    -hf unsloth/gemma-4-12b-it-qat-GGUF:UD-Q4_K_XL \
    --temp 1.0 --top-p 0.95 --top-k 64

The sampling trio — temperature 1.0, top-p 0.95, top-k 64 — is Gemma 4’s recommended decoding and is not the default on most runtimes. Set all three or the model looks worse than it is. (Source: Unsloth QAT docs)

At this point you have the small-and-accurate base. Now make it fast.


3. Lever 2 — MTP: how multi-token prediction actually works

Normal decoding is sequential: one forward pass produces one token, then you do it again. That’s the bottleneck. MTP breaks it with a drafter.

Here’s the mechanism, from Google’s own description: a lightweight drafter proposes several tokens, and “if the target model agrees with the draft, it accepts the entire sequence in a single forward pass — and even generates an additional token of its own in the process.” Crucially, the drafters “seamlessly utilize the target model’s activations and share its KV cache, meaning they don’t have to waste time recalculating context the larger model has already figured out.” (Source: Google MTP overview)

Two consequences matter for you:

  1. It’s lossless. The big model still verifies every token, so the output is byte-for-byte what you’d get without MTP. This is not a quality/speed tradeoff.
  2. The speedup depends on the acceptance rate. When the drafter guesses right, you skip work; when it guesses wrong, you fall back to normal. Predictable text (chat, prose) accepts more; high-entropy text (dense code) accepts less.

Google’s official framework support for MTP is Transformers, MLX, vLLM, SGLang, Ollama, and AI Edge (Source: Google MTP overview). Note llama.cpp is not on Google’s list — because llama.cpp added it independently, which brings us to the three practical paths below.


4. MTP path A — Ollama (easiest, Mac-first)

If you want the least friction, Ollama added Gemma 4 MTP speculative decoding in v0.23.1, on the MLX runner first (so Apple Silicon Macs get it earliest), with broader runner support being validated. (Source: Ollama release notes / PR #15980)

Pull the model normally:

ollama pull gemma4           # default 31B dense flagship
ollama pull gemma4:e4b       # on-device 4B
ollama pull gemma4:e2b       # on-device 2B

To wire up a drafter, Ollama added an MTP workflow: import a safetensors-based Gemma 4 draft model with ollama create, point a DRAFT directive in your Modelfile at it, and optionally quantize the drafter with --quantize-draft on ollama create. (Source: Ollama PR #15980)

If you’re on an Apple Silicon Mac, this is your path — MLX got MTP first and Ollama’s unified-memory story makes the 4-bit + drafter combination painless. On Intel Macs without a usable GPU, neither lever will give you real interactive speed; MTP helps throughput but you’re still CPU-bound.


5. MTP path B — mainline llama.cpp (merged June 7, 2026)

This is the new part. MTP for Gemma 4 was merged into mainline ggml-org/llama.cpp on June 7, 2026 (PR #23398 by am17an), after the foundational MTP-head support in PR #22673. No fork required anymore. (Source: llama.cpp PR #23398)

The basic invocation, straight from the PR:

llama-server -hf am17an/Gemma4-31B-it-GGUF \
    --spec-type draft-mtp \
    --spec-draft-n-max 4

--spec-type draft-mtp turns on the MTP drafter; --spec-draft-n-max 4 caps how many tokens it drafts per step. For a multi-GPU box you’ll also want to place the drafter explicitly:

llama-server -hf am17an/Gemma4-31B-it-GGUF \
    --spec-type draft-mtp --spec-draft-n-max 4 \
    --spec-draft-device <device> -sm layer

(Source: llama.cpp PR #23398.)

Two limits from the PR itself, so you don’t waste time:

  • It works for the 31B and 26B-A4B models; E4B/E2B are not yet supported via this path.
  • A bug with quantized KV caches (Q8_0) was found and fixed during review — so update to a build that includes the merge, don’t run a stale checkout.

The author verified correctness by replicating “AIME-26 (~87%) results as advertised by the Gemma team” with MTP on — i.e. the speedup didn’t cost accuracy. (Source: llama.cpp PR #23398)


6. MTP path C — Transformers (the official reference)

If you want the canonical implementation — or you’re already on the Transformers stack for Gemma 4’s audio modality — Google documents MTP directly. You load the target model and its matching -assistant drafter, then pass the drafter into generate:

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

TARGET_MODEL_ID = "google/gemma-4-E2B-it"
ASSISTANT_MODEL_ID = TARGET_MODEL_ID + "-assistant"

processor = AutoProcessor.from_pretrained(TARGET_MODEL_ID)
target_model = AutoModelForCausalLM.from_pretrained(
    TARGET_MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto",
)
assistant_model = AutoModelForCausalLM.from_pretrained(
    ASSISTANT_MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto",
)

Then generate with the assistant attached:

outputs = target_model.generate(
    **inputs,
    assistant_model=assistant_model,
    max_new_tokens=256,
    do_sample=False,
)

You can tune how aggressively it drafts:

assistant_model.generation_config.num_assistant_tokens = 4
assistant_model.generation_config.num_assistant_tokens_schedule = "heuristic"

The heuristic schedule adapts automatically — it drafts +2 more tokens when all are accepted and −1 when it sees rejections. (Source: Google MTP docs.) Note that the Transformers path documents drafters for E2B, E4B, 31B, and 26B-A4B, each needing its own -assistant variant.


7. Stacking both: a QAT GGUF and MTP

The point of separating the levers is that you combine them. On llama.cpp that means a QAT-quantized target plus the MTP drafter:

  • Target: the QAT GGUF from §2 (small, accurate, ~7 GB for 12B-class weights).
  • Drafter: the MTP draft head via --spec-type draft-mtp.

You get the memory savings of QAT and the throughput of MTP at the same time — a 4-bit model that still answers like the full one, generating multiple tokens per pass. That’s the “twice as fast on a single GPU, quality barely moved” result builders have been posting this week.

Worth flagging on sourcing: the eye-catching numbers floating around — e.g. “127 t/s from a 7GB file” (Fahd Mirza on X, user benchmark) and “2.6–2.98x lossless speedup” (the ik_llama.cpp bench harness, a fork, not mainline) — are third-party measurements on specific hardware, not official figures. Treat them as plausible upper-ish bounds, not promises.


8. What speedup to actually expect

Here’s the honest range, with each number attributed:

SourceSpeedupContext
Google (official)up to 3x, losslessbest case; 26B MoE on an RTX PRO 6000
Google (official)1.8–3x typicalacross supported hardware
Most developer hardware~1.7–2.2xthe realistic everyday number
am17an (mainline PR)>2x on densecontributor’s own system; MoE gained less
ik_llama.cpp fork bench2.6–2.98xthird-party harness, specific setup

(Sources: Google MTP overview; llama.cpp PR #23398; karany97 bench.)

The single biggest variable is acceptance rate (~80% in Google’s framing), and that depends on your workload: conversational and prose tasks accept more drafted tokens than dense code, so a chatbot will see a bigger lift than a code agent. Hardware matters too — the 3x headline needs a high-end GPU and the MoE model.


9. Benchmark it on your own hardware

Every number in §8 is someone else’s machine. The advice that closes this guide — don’t trust a t/s figure you didn’t produce — only works if you actually produce one. Here’s the minimal A/B.

The method: run the same prompt twice on the same build, once without MTP and once with it, and compare the tokens-per-second llama.cpp reports.

Baseline (no drafter):

llama-server -hf am17an/Gemma4-31B-it-GGUF
# ...send one fixed prompt, note the eval tokens/sec

With MTP:

llama-server -hf am17an/Gemma4-31B-it-GGUF \
    --spec-type draft-mtp --spec-draft-n-max 4
# ...send the *same* prompt, compare

llama.cpp prints timing at the end of each generation — the eval time / tokens-per-second line is the one to read. The ratio between the two runs is your real speedup, on your hardware, for that workload.

Make the comparison fair:

  • Same prompt, same length. Acceptance rate is workload-dependent (§8), so a chat prompt and a code prompt give different ratios. Test the kind of work you actually do.
  • Warm, not cold. The first run pays model-load and cache-warm costs. Discard it; measure the second.
  • Tune --spec-draft-n-max. It caps tokens drafted per step (§5). Higher can help on predictable text and hurt on high-entropy text, because rejected drafts are wasted work. Try 3–5 and keep what wins on your prompt.

If the MTP run isn’t faster, the drafter isn’t earning its keep on that workload — that’s a real result, not a failure. MTP is lossless either way (§3), so there’s no quality downside to leaving it on; the only question is whether your acceptance rate is high enough to net a win.


10. The catch

  • Size support is uneven by path. Mainline llama.cpp MTP covers 31B and 26B-A4B, not E4B/E2B (§5). The Transformers reference covers E2B/E4B/31B/26B-A4B (§6). If you’re on the small edge models, use Transformers/MLX, not llama.cpp, for MTP today.
  • Update your build. The Q8_0 KV-cache bug was fixed in the merge — a checkout from before June 7 either won’t have MTP or may hit it (§5).
  • QAT ≠ MTP. Running a QAT GGUF alone gives you memory savings, not the throughput jump. If you only did §2, you haven’t turned on the fast part.
  • Speed is workload-dependent. Don’t quote yourself the 3x number for a code agent; budget for ~1.7–2.2x on typical hardware (§8).
  • Lossless means lossless. If MTP changes your outputs, something is misconfigured (wrong drafter, mismatched build) — correct MTP is verified by the full model and shouldn’t alter results.

11. Where to go next

The takeaway: these are two free, stackable wins. QAT is the floor everyone should be on; MTP is the new ceiling, and as of June 7 you no longer need a fork to reach it on llama.cpp. Turn on both, measure on your workload, and don’t trust any single t/s number you didn’t produce yourself.

Sources