Run Qwen3.6-35B-A3B Locally for Coding: llama.cpp, Quants & VRAM
Read time: ~8 minutes. What you’ll learn: why Qwen3.6-35B-A3B is the model the local-LLM crowd won’t stop talking about, exactly which quantization to run on your hardware, the copy-paste
llama.cppcommands, Mac (M1/M2/M3) notes, and how to point a coding agent at it. All commands verified against Unsloth’s official GGUF docs; all benchmarks flagged as self-reported.
For two years the deal with local code models was: you either run something small and dumb, or you rent a frontier model in the cloud. Qwen3.6-35B-A3B is the release that’s making r/LocalLLaMA reconsider — it posts near-frontier coding numbers while activating only 3B parameters per token, which means it actually fits and runs fast on hardware you already own.
This is a hands-on guide to running it locally for coding. A note on sourcing up front: the benchmark numbers below are Qwen’s own self-reported figures (verify on your repo before trusting them), and every command is taken from Unsloth’s official GGUF docs — copy-paste safe.
1. What it actually is
Qwen3.6-35B-A3B shipped April 16, 2026 under the Apache 2.0 license (fully open weights). The name encodes the trick:
- 35B total parameters, but only ~3B active per token (A3B). It’s a sparse Mixture-of-Experts — 256 experts, 8 routed + 1 shared active. You pay 35B in disk/RAM but 3B in compute, so inference is fast.
- Architecture mixes Gated DeltaNet (linear attention) + Gated Attention + sparse MoE — the linear-attention part is what keeps long context cheap.
- 256K context native, extendable to ~1M via YaRN.
- Multimodal — handles text and images.
The coding numbers (Qwen’s self-reported, per the official blog):
| Benchmark | Qwen3.6-35B-A3B |
|---|---|
| SWE-bench Verified | 73.4% |
| SWE-bench Pro | 49.5% |
| Terminal-Bench 2.0 | 51.5% |
| AIME 2026 | 92.6% |
For context on the “small active params” claim: Qwen reports it beating Gemma 4-31B (a dense model that fires all 31B every step) on SWE-bench, 73.4% vs 52.0%. Treat all of these as “competitive, verify on your stack” — not gospel. The point isn’t the leaderboard; it’s that a model this capable runs on consumer hardware at all.
2. Pick your quant (the VRAM math)
This is the decision that determines whether it runs on your machine. Memory needed by quantization (from Unsloth’s GGUF release):
| Quant | Memory | Runs on |
|---|---|---|
| 3-bit | 17 GB | 16–24GB GPU, 24GB Mac |
| 4-bit (UD-Q4_K_XL) | 23 GB | 24GB GPU (4090), 32GB Mac — recommended balance |
| 6-bit | 30 GB | 32GB+ GPU, 36GB Mac |
| 8-bit | 38 GB | 48GB GPU, 48GB Mac |
| BF16 (full) | 70 GB | 80GB GPU / 96GB+ Mac |
The sweet spot is 4-bit UD-Q4_K_XL at ~23GB — it fits a single RTX 4090 (24GB) or a 32GB Apple Silicon machine, with minimal quality loss. If you’re tight, 3-bit at 17GB squeezes onto a 24GB Mac with room for context. Don’t reach for BF16 unless you have a datacenter card; the quality delta over Q4 isn’t worth 3× the memory for local coding.
Remember the MoE math: it’s a 35B model on disk, but only 3B activate per token. So tokens/sec is closer to a 3B model than a 35B one — that’s the whole reason it’s viable locally.
3. Run it with llama.cpp
Build llama.cpp (Linux/CUDA shown; for Mac see §4):
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
Download the recommended quant (and the multimodal projector if you want image input):
pip install huggingface_hub hf_transfer
hf download unsloth/Qwen3.6-35B-A3B-GGUF \
--local-dir unsloth/Qwen3.6-35B-A3B-GGUF \
--include "*UD-Q4_K_XL*" --include "*mmproj-F16*"
Run a server with an OpenAI-compatible endpoint (this is what your coding tools will talk to):
./llama.cpp/llama-server \
--model unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
--ctx-size 16384 --port 8001
The endpoint is now at http://localhost:8001/v1.
Coding-specific settings: Qwen recommends
temperature=0.6, presence_penalty=0.0for code (vs 1.0 for open-ended chat). Lower temp = more deterministic edits.
4. On a Mac (M1/M2/M3)
Apple Silicon is genuinely good for this model because unified memory means the GPU can address all your RAM. Two routes:
- llama.cpp: build with
-DGGML_CUDA=OFF(Metal is the default backend). Samellama-servercommand as above. - MLX (often faster on Mac): Unsloth ships MLX quants in 3/4/6/8-bit. A 32GB M-series machine comfortably runs the 4-bit; a 24GB one runs 3-bit.
The r/LocalLLaMA reports of “battery-powered coding on an M1 Max” are real — the 3B active-param count is what makes that possible.
5. Wire it into a coding agent
Because llama-server exposes an OpenAI-compatible /v1 endpoint, any tool that lets you set a custom base URL can use it. Point your editor/agent at http://localhost:8001/v1 with any dummy API key, and select the served model. That covers most local-first coding setups.
Two speed/quality levers worth knowing:
- MTP (multi-token prediction / speculative decoding) gives a reported 1.4–2.2× speedup: add
--spec-type draft-mtp --spec-draft-n-max 2to the server command. - Thinking vs non-thinking: the model defaults to a reasoning mode. For fast, terse code edits you can disable it with
--chat-template-kwargs '{"enable_thinking":false}'.
If you want the model to act as an agent (read/write files, run shell), pair it with a harness — for instance, llama-server’s own built-in tools turn any local GGUF into a code-editing agent with --tools all. That combination — Qwen3.6 weights + llama.cpp tool execution — is a fully local coding agent with no cloud dependency.
6. Gotchas (field-reported)
- Avoid CUDA 13.2 — Unsloth flags it as causing gibberish output. Use a different CUDA build.
- Multimodal needs the
mmproj-F16.gguffile passed via--mmproj; without it you get text-only (which is fine for pure coding). - Don’t over-extend context. 256K is native; pushing toward 1M via YaRN costs memory and can degrade quality. For coding, 16K–32K is plenty and keeps you fast.
- Output length: Qwen recommends up to 32,768 output tokens for reasoning tasks — set
--n-predictaccordingly if you see truncation.
7. Local vs cloud — when each wins
Running Qwen3.6 locally makes sense when: you want zero per-token cost on high-volume work, your code can’t leave your machine (privacy/compliance), or you want offline/airplane coding. The honest tradeoff: a 4-bit local model won’t match a frontier cloud model on the hardest agentic tasks, and you’re spending VRAM you could use elsewhere.
If you’d rather call Alibaba’s frontier tier in the cloud instead of self-hosting, that’s the closed flagship — see our Qwen3.7-Max guide for that side of the lineup. Qwen3.6-35B-A3B (this post) is the open-weights, run-it-yourself option; Qwen3.7-Max is the hosted frontier one. Different tools for different jobs.
The takeaway
Qwen3.6-35B-A3B matters less for any single benchmark number and more for what it represents: a near-frontier coding model that fits in 23GB and runs at 3B-active speed. That’s the line crossing from “local models are a toy” to “I can actually code against this on my own hardware.” Pick the 4-bit quant, run it through llama.cpp, point your agent at the local endpoint, and you have a coding setup that costs nothing per token and never leaves your machine.
For the broader local-coding stack, see llama-server’s built-in agent tools, and for getting more out of any coding agent, using AI to write better code more slowly.
Sources
- Qwen3.6-35B-A3B on Hugging Face — official weights and model card
- Qwen official blog — specs and (self-reported) benchmarks
- Unsloth GGUF docs — local run commands, quant sizes, settings (commands verified here)
- Specs and benchmarks verified 2026-05-31; benchmark figures are Qwen’s own — verify on your workload.