Research · · 2 min read

Anthropic documented exactly how it sandboxes Claude — and open-sourced the tool

On May 25, 2026 Anthropic published 'How we contain Claude across products,' a detailed engineering breakdown of the sandboxes behind claude.ai, Claude Code, and Claude Cowork — plus srt, an Apache-2.0 sandboxing CLI any agent builder can use today. Here are the mechanisms and what to copy.


On May 25, 2026, Anthropic published How we contain Claude across products — a rare, mechanism-level breakdown of the sandboxes it runs behind every Claude surface. Simon Willison flagged it as unusually well-documented on May 30, which pushed it onto Hacker News. For anyone shipping their own coding agent, it doubles as a free threat-model checklist.

Key facts:

  • The post covers three products with three different isolation layers. (Source: Anthropic Engineering, 2026-05-25)
  • claude.ai runs Claude in gVisor containers with seccomp filtering on isolated infrastructure.
  • Claude Code uses OS-level sandboxes: Seatbelt (sandbox-exec) on macOS and bubblewrap on Linux.
  • Claude Cowork runs a full virtual machine via the platform hypervisor — Apple Virtualization on macOS, HCS on Windows.
  • Anthropic open-sourced the underlying tool as srt (Sandbox Runtime), under Apache 2.0.

Anthropic’s framing is that the model is probabilistic and will eventually be talked into something, so the job is to “cap the blast radius” with deterministic walls around it. The diagram below is the whole thesis: untrusted inputs and outbound actions both get audited at hard, non-probabilistic boundaries, while the model sits inside an environment ceiling it cannot exceed.

Anthropic diagram: external audit on inputs in, a hard-ceiling environment around the probabilistic model, and external audit on actions out
Anthropic's containment model: deterministic boundaries on input and output, a hard ceiling around the model (source: Anthropic Engineering).

Three deterministic controls do the heavy lifting. Egress is an allowlist of outbound destinations plus a defensive proxy that inspects API calls. Credentials never enter the sandbox — they stay in the host keychain, and the VM only receives scoped session tokens. Filesystem access is constrained by mount modes (read-only, read-write, read-write-no-delete), with symlinks resolved before path validation.

The Cowork architecture shows where the VM boundary actually sits. The shipped design (“host-loop”) keeps the agent loop on the host and pushes only code execution into the VM, trading some isolation for usability.

Comparison diagram of full-VM mode versus the shipped host-loop mode for Claude Cowork
Full-VM vs. the shipped host-loop mode for Cowork (source: Anthropic Engineering).

The disclosed gaps are the useful part

Anthropic does not pretend this is solved. Two numbers stand out. Telemetry showed users approved roughly 93% of permission prompts — so a human in the loop is barely a control. And in an internal phishing test, 24 of 25 exfiltration attempts succeeded before the boundaries were tightened. They also disclosed that .claude/settings.json hooks could run code before trust prompts were established, and an “approved-domain” attack where a malicious file uses an attacker’s API key to upload data to api.anthropic.com — patched by having the egress proxy verify the token.

Attack-and-fix diagram: a malicious file exfiltrates to api.anthropic.com using an attacker key; the fix is an egress proxy that checks the token
The "approved domain" exfiltration and its fix — an egress proxy that checks the token (source: Anthropic Engineering).

What this means if you’re building agents

The takeaways are concrete. First, don’t rely on approval prompts — at a 93% approval rate they are theater; gate actions deterministically instead. Second, Anthropic’s advice is to “treat project-open, config-load, and localhost listeners the way you’d treat any inbound request from the internet” — your own agent’s config loader is an attack surface. Third, you can stop hand-rolling this: srt (anthropic-experimental/sandbox-runtime) enforces filesystem and network limits on arbitrary processes, MCP servers, and bash commands, with no container required. It’s TypeScript, Apache-2.0, and runs on macOS (sandbox-exec) and Linux (bubblewrap); Windows isn’t supported yet.

This is the defensive mirror image of the Microsoft Copilot Cowork exfiltration we covered last week — same failure shape, opposite vantage point. If you run Claude Code as a daily driver, the sandboxing controls in our Claude Code guide are the consumer-facing version of exactly these boundaries, and the deeper pattern is the one in our note on constraint decay in coding agents: the agent rarely gets jailbroken, it just does what an attacker politely asked through a channel it was allowed to use.

Sources

Source: Anthropic Engineering