Tools · · 2 min read

llama-server now ships with built-in agent tools — `--tools all` makes any local GGUF model a code-editing agent

llama.cpp's server now has built-in execution tools (read_file, write_file, edit_file, exec_shell_command, grep_search, apply_diff). With `--tools all`, any local model becomes a Cursor/Claude-Code-style code agent — no MCP server, no Python wrapper. The catch: it's direct execution on the host, with a 'do not enable in untrusted environments' warning.


A Reddit thread hit r/LocalLLaMA on May 23, 2026 with a finding that surprised even the local-LLM regulars: llama-server now has built-in native tool execution. Not chat-template flags, not jinja transforms — actual execution of exec_shell_command, edit_file, write_file, read_file, grep_search, file_glob_search, apply_diff, and get_datetime directly from the server process.

The flag is one line:

llama-server -m model.gguf --tools all

The server’s README documents the full list and the activation flag (--tools name1,name2,... to enable individual tools, or all for everything). Tools are exposed via an internal /tools REST endpoint that the Web UI calls. The README is blunt about scope: “do not enable in untrusted environments.”

How we got here (chronology matters)

This isn’t a single PR — it’s the second wave of an agent push that started two months ago:

  • March 6, 2026 — PR #18655 (by allozaur) merged full MCP client support into llama-server: 15,285 lines across 147 files. That gave local models an agentic loop that could call external MCP servers — the same protocol Claude Code, Cursor, and Continue use.
  • Now (May 2026) — the --tools family adds built-in tools that execute on the llama-server host itself, no external MCP server required.
  • Still open: issue #21126 (filed March 28, 2026) asks for a --tool-executor flag to wrap execution in a sandbox (Firejail, Podman, Sandboxie). Today, tools run as whoever started the server. No --no-execute-as-root guardrail yet.

What this means if you’re a builder

1. Local agents just got a default backend. Until this, “I want a local Claude Code” required: llama-server for inference + a separate MCP file/shell server + a desktop client to wire them together. That’s three moving parts. With --tools all, the second part is built in, and the agent loop runs against the same OpenAI-compatible API your existing harness already speaks. For anyone running Qwen3.7-Max, Llama 3.x, GPT-OSS, or DeepSeek V3.x locally, this collapses the agent stack to one binary.

2. Don’t expose port 8080 to the internet — full stop. The “do not enable in untrusted environments” warning is doing a lot of work. Today’s llama-server with --tools all is functionally equivalent to giving any HTTP client a shell on the box. No auth, no chroot, no namespaces. If you tunnel it for remote use, terminate TLS + add an auth proxy in front of it; do not bind 0.0.0.0.

3. The “I want a sandbox” story is unfinished. Issue #21126 explicitly proposes a wrapper command (think firejail --net=none -- read_file ...) so each tool call gets containerised. That’s not merged yet. Until it is, the production-safe pattern is: keep --tools all strictly for local dev loops, and for shared infra still use the MCP-server-with-sandbox pattern from the March release.

4. Test it on a coding model first. exec_shell_command is only useful if your model emits well-formed tool calls. Qwen3.x-Coder, Llama 3.3 70B Instruct, and the newer DeepSeek-Coder variants are the realistic short-list. A general-chat model will hallucinate tool calls into the prose and the loop will stall. The MCP-era benchmark for “this model is actually agent-ready” is pass-rate on multi-step file edits — and for local builds the practical bar is “can it run pytest, read the failure, edit the source, re-run, and stop”.

The bigger picture

llama.cpp is now legitimately a full local agent runtime, not just an inference engine. Combined with MCP support from March and the new tool-call execution layer, a Mac Studio or a single H100 can host an entire agent stack — Cursor-Composer-style behavior, but on your own hardware, with your own GGUF weights, and your own filesystem.

For builders comparing local vs cloud agent loops on cost, latency, and privacy, this is the moment where local stops being a research toy and starts being a deployable alternative. The unresolved security model is the only thing that should stop a production team from shipping it tomorrow.

Sources

Source: llama.cpp README