How to Run Kimi K2.7 Code Locally (and Whether You Actually Should)

Read time: ~7 minutes. What you’ll learn: what Kimi K2.7 Code is, the honest hardware reality of a 1-trillion-parameter model (spoiler: the smallest usable quant needs ~339 GB of combined memory), and three real ways to run or use it — llama.cpp + Unsloth GGUF for big-memory local rigs, the official vLLM / SGLang servers for datacenter GPUs, and the Moonshot API or GitHub Copilot path that’s the right answer for almost everyone — plus quant selection, realistic throughput, and when local actually makes sense.

Sourcing note: model specs, license, benchmarks, and the vLLM/SGLang commands are from the official moonshotai/Kimi-K2.7-Code model card. GGUF quant sizes, the llama.cpp command, and hardware guidance are from the Unsloth Kimi K2.7 Code docs — Unsloth’s GGUF is a third-party (but widely used and Moonshot-referenced) distribution, flagged as such. Benchmark numbers are Moonshot’s own. Links at the bottom.

Kimi K2.7 Code trended the day it went generally available in GitHub Copilot, and the first question every builder asks about an open-weight model is “can I run it myself?” For K2.7 the honest answer is: yes, but almost certainly not on the machine you’re reading this on. This is a 1-trillion-parameter model. Let’s be clear-eyed about what that means, then cover the three routes that actually work.


1. What Kimi K2.7 Code actually is

Kimi K2.7 Code is Moonshot AI’s agentic coding model — the coding-specialized member of the Kimi K2 line. The architecture (from the official model card):

  • 1T total parameters, 32B active — a Mixture-of-Experts model with 384 experts, 8 selected per token, 1 shared expert, 61 layers (one dense), and MLA attention.
  • 256K context window.
  • Thinking-only, with vision support.
  • License: Modified MIT — open weights, commercially usable (read the exact terms before shipping).

Versus the previous K2.6, Moonshot reports it “strengthens performance on real-world coding tasks and agentic workflows” while using ~30% fewer thinking tokens — i.e. it reaches answers with less internal reasoning overhead, which matters for both latency and cost.

The “32B active” part is what makes a 1T model tractable at all: only 32B parameters fire per token. But MoE reduces compute, not memory — you still have to hold (nearly) all 1T parameters in memory to serve the model. That’s the crux of the hardware problem below.


2. The hardware reality check (read this before downloading 339 GB)

Here’s the honest math. Even heavily quantized, K2.7 Code is enormous. From Unsloth’s GGUF builds:

QuantDisk sizeRealistic RAM+VRAM needed
UD-Q2_K_XL (dynamic 2-bit)339 GB~325–350 GB
UD-Q4_K_XL (4-bit)584 GB~600 GB
UD-Q8_K_XL (lossless)595 GB~605 GB
Full precision605 GB605 GB+

The rule of thumb: your combined RAM + VRAM should roughly equal the quant file size. So the smallest usable build wants ~339 GB of memory. That’s not a gaming PC or a MacBook — it’s a high-RAM DDR5 CPU build (e.g. 384–512 GB system RAM) or a multi-GPU server, optionally with system RAM offload.

And even when it fits, throughput at the low end is modest: on a big-memory CPU/offload rig you should expect single-digit tokens per second at 2-bit, with some quality loss from the aggressive quantization. To get the headline >100 tokens/s, Unsloth’s own reference is B200 GPUs — datacenter hardware.

Bottom line: if you have a workstation with 350 GB+ of memory, §3 is for you. If you have datacenter GPUs, §4. If you have a normal machine — even a very good one — skip to §5; you’ll get a better experience through the API or Copilot, and this guide will tell you so honestly rather than pretend a 1T model fits in 24 GB of VRAM.


3. Method 1 — llama.cpp + Unsloth GGUF (big-memory local)

If you have the RAM, the local route is Unsloth’s GGUF quants run through llama.cpp. Note that the Unsloth GGUF is a third-party distribution (widely used and referenced by Moonshot, but not Moonshot’s own upload) — call it out if provenance matters to you.

The recommended command, quoted from the Unsloth docs (note the --mmproj file — that’s the vision projector, since K2.7 is multimodal):

./llama.cpp/llama-cli \
    --model unsloth/Kimi-K2.7-Code-GGUF/UD-Q2_K_XL/Kimi-K2.7-Code-UD-Q2_K_XL-00001-of-00008.gguf \
    --mmproj unsloth/Kimi-K2.7-Code-GGUF/mmproj-F16.gguf \
    --temp 1.0 \
    --top-p 0.95

A few things worth knowing:

  • The model ships as a multi-part GGUF (00001-of-00008 here) — you download all shards; llama.cpp loads the set from the first file.
  • Recommended sampling: --temp 1.0, --top-p 0.95. Use these; K2.7 is a thinking model and behaves best at its intended settings.
  • Suggested context: 98,304 tokens, up to 262,144 (256K). Don’t max the context on a memory-constrained rig — it adds to the memory you already can’t spare.
  • Unsloth Studio and llama.cpp are the primary routes. There’s no first-class Ollama one-liner for this model the way there is for small models; the builds target llama.cpp / LM Studio / Jan for people who have the hardware.

If you’re standing up a serving endpoint rather than a CLI, use llama-server from the same GGUF and hit it as an OpenAI-compatible API.


4. Method 2 — Official vLLM / SGLang (datacenter GPUs)

If you have the GPU memory to run the weights properly (multi-GPU, ideally without extreme quantization), Moonshot documents two official serving paths on the model card. These pull the official weights (moonshotai/Kimi-K2.7-Code), not a repackage:

# vLLM
pip install vllm
vllm serve "moonshotai/Kimi-K2.7-Code"
# SGLang
pip install sglang
python3 -m sglang.launch_server --model-path "moonshotai/Kimi-K2.7-Code"

Both expose an OpenAI-compatible endpoint you can point a coding agent at. This is the route if you’re a team self-hosting for data-control or throughput reasons and you have the hardware budget — a 1T MoE at usable precision is a serious GPU footprint, so plan capacity before you commit.


5. Method 3 — The pragmatic path: Moonshot API or GitHub Copilot

For the overwhelming majority of builders, this is the right answer, and there’s no shame in it — you don’t self-host a 1T model to try it.

  • Moonshot API. K2.7 Code is available at platform.moonshot.ai with OpenAI- and Anthropic-compatible endpoints — so you can point most existing tooling at it by swapping the base URL and model name, no client rewrite.
  • GitHub Copilot. K2.7 Code went generally available in GitHub Copilot — pick it from the model selector in your editor and you’re using it in seconds, with none of the memory math above. (If you’re weighing Copilot’s usage-based costs, see our GitHub Copilot billing guide.)

The honest trade: local gives you data control and no per-token bill; API/Copilot gives you full-quality inference, zero hardware outlay, and you’re running in minutes. Unless you specifically need on-prem or you already own the hardware, start with the API — validate that K2.7 is right for your workload — then decide whether self-hosting is worth 350+ GB of memory.


6. What you get: benchmarks

Moonshot’s own reported numbers for K2.7 Code vs the prior K2.6 (agentic-coding oriented):

BenchmarkK2.7 CodeK2.6
Kimi Code Bench v262.050.9
Program Bench53.648.3
MCP Atlas76.069.4

These are the project’s own results (independent reproduction will follow), but the direction is consistent: a solid generational jump on coding and tool-use/agent benchmarks, achieved while spending ~30% fewer thinking tokens. For agentic coding — long tool-using loops — the MCP Atlas gain and the token-efficiency improvement are the numbers that translate most directly into real-world cost and latency.


7. When local actually makes sense

Be honest with yourself about which bucket you’re in:

  • Self-host locally (Method 1/2) if you already have 350 GB+ of memory or datacenter GPUs, you need on-prem data control, or you’re running enough volume that per-token API cost dominates a hardware amortization. For those, the weights being Modified MIT and openly available is exactly the point.
  • Use the API / Copilot (Method 3) if you’re evaluating, iterating, or running normal volumes on normal hardware — which is almost everyone. You get full-quality K2.7 without the memory bill.

There’s no wrong answer, only a mismatch between your hardware and your route. The failure mode is downloading 339 GB, discovering it swaps to disk at 1 token/second, and concluding the model is bad — when really you just picked the wrong path for your machine.

The takeaway

Kimi K2.7 Code is a genuinely strong 1T-parameter (32B active), Modified-MIT, 256K-context agentic coding model — and running it locally is real but heavy: the smallest usable quant is ~339 GB of combined memory, via Unsloth’s GGUF + llama.cpp, or the official vLLM / SGLang servers on datacenter GPUs. For everyone else — the vast majority — the honest best path is the Moonshot API (OpenAI/Anthropic-compatible) or GitHub Copilot, where K2.7 just went GA. Match the route to your hardware, start with the API to validate the model, and only take on 350 GB of memory once you’re sure it’s worth it.

For genuinely laptop-friendly local coding models, see Ornith-1.0 locally (9B) and Qwen 3.6 for local coding.

Sources

  • moonshotai/Kimi-K2.7-Code — Hugging Face model card — 1T total / 32B active MoE (384 experts, 8/token, 1 shared, 61 layers, MLA), 256K context, Modified MIT license, vLLM/SGLang official serving commands, benchmarks (Kimi Code Bench v2 62.0, Program Bench 53.6, MCP Atlas 76.0), ~30% fewer thinking tokens vs K2.6, platform.moonshot.ai API (OpenAI/Anthropic-compatible)
  • Kimi K2.7 Code — Unsloth docs (Run Locally) — GGUF quant sizes (Q2 339 GB / Q4 584 GB / Q8 595 GB / full 605 GB), RAM+VRAM guidance (~325–350 GB at 2-bit), llama.cpp command with --mmproj, --temp 1.0 --top-p 0.95, suggested context 98,304 (up to 262,144), >100 tok/s on B200s. Third-party GGUF distribution.
  • Kimi K2.7 is now available in GitHub Copilot — GitHub Changelog — general availability in Copilot
  • Benchmark numbers are Moonshot’s own reported results, pending independent reproduction. Hardware figures are Unsloth’s guidance. Verified July 3, 2026 — confirm current details on the official model card and Unsloth docs before relying on them.