Use AI to write better code, more slowly — the multi-agent code review workflow that beats one-shot generation

Last updated: May 27, 2026. Read time: 9 minutes. What you’ll learn: Nolan Lawson’s 1,162-upvote HN essay inverts the default builder assumption about AI coding. His workflow runs 4-7 sub-agents per PR for review, takes 20+ minutes, and ships fewer lines than naive AI coding. We unpack the workflow, why the EURECOM Constraint Decay paper says it works, why DeepSeek’s V4 Pro cache math makes it newly affordable, and the 5-step recipe builders can copy this week.

The default narrative about AI coding right now is speed: ship faster, write 500-line PRs in an hour, replace junior engineers with agents. Nolan Lawson — Seattle-based programmer at Socket, behind Read the Tea Leaves and the worker-timers / Pinafore projects — published an essay on May 25 that flipped the framing:

“A lot of people seem convinced that the point of AI coding is to write low-quality code as fast as possible… you can use them just as effectively to write high-quality code more slowly.”

It hit 1,162↑ on HN in 36 hours. The reason that number matters: Nolan isn’t an AI skeptic — he’s an experienced shipping engineer endorsing a specific workflow that uses AI more, not less. This article walks through the workflow, lines it up with the published evidence on where AI coding actually fails, and gives you a 5-step recipe you can copy.


1. The inversion in plain terms

Most people doing AI coding today:

  1. Open Cursor / Claude Code / Aider
  2. Ask the model to write a feature
  3. Skim the diff
  4. Ship the PR
  5. Move to the next ticket

Nolan’s bet: steps 3 and 4 are where the value gets destroyed. A 500-line PR that the human “barely understands” — his exact phrase — accumulates bugs, hidden assumptions, and review debt. The model is fast at generating; humans are slow at truly reviewing, and when humans skip the review the bugs compound.

His alternative loop replaces “skim and ship” with multi-agent adversarial review, where you spend the time savings from AI-generated code on running more AI to find problems in it.


2. Nolan’s actual workflow

Reconstructed from his post + the comment thread (heckj, Graham Wheeler, Ollie’s RLS gap incident):

The tools he names

  • Claude Opus 4.7 (xhigh thinking mode) — primary reasoning model
  • Codex GPT-5.5 (high thinking mode) — second-opinion generator
  • Cursor Bugbot — automated bug scanning
  • Matt Pocock’s /grill-me skill — agent that quizzes you until you “understand the entire PR front-to-back”
  • Markdown + Mermaid diagrams — generated by the agents to externalize their understanding

The flow (paraphrased from his post)

  1. Generate the PR the normal way (agent writes code, human reviews surface)
  2. Spawn N sub-agents to find bugs, each with a different lens — security, perf, race conditions, edge cases, data integrity, etc. heckj suggests “5-7 different lenses” in parallel, “wiping the context between sweeps.”
  3. Each sub-agent classifies findings as critical / high / medium / low
  4. Aggregate the findings and triage:
    • Fix all criticals + highs (with the human guiding the agent)
    • Skip mediums / lows where “the juice isn’t worth the squeeze”
    • Abandon the PR if there are so many criticals that the approach itself is misguided
  5. Use /grill-me until you can answer questions about every line — if the agent’s questions stump you, you don’t understand the PR yet
  6. Then ship

Nolan’s blunt summary of the time cost: “I’m happy to wait 20 minutes for a better review!“


3. Why this works — the Constraint Decay evidence

If Nolan’s workflow sounds like “running 5× the agents to be 5× less productive,” it would be hard to defend on velocity metrics alone. The published evidence says velocity isn’t the right metric for production code anymore.

Recall from our Constraint Decay walkthrough — the EURECOM benchmark on 8 models × 8 frameworks:

  • Capable models lose 30 percentage points on assertion pass rate from “unconstrained generation” to “with architecture + DB + ORM constraints”
  • The best L3 configuration (OpenHands + MiniMax-M2.5) only hits 8.3% pass@1 — meaning every assertion passes
  • 45% of failures trace to data-layer defects (incorrect query logic + ORM runtime errors) the surface review misses

What Nolan’s loop does is exactly the bridge from “A% high” (the model gets most assertions right) to “pass@1 high” (every assertion passes). Single-shot review catches the obvious bugs and ships the subtle data-layer ones. Multi-agent review with different lenses catches the long-tail data-layer bugs that one model — even Opus 4.7 — misses on a single pass.

A concrete example from Nolan’s comment thread: Ollie mentioned shipping a Next.js + Supabase PR where the agent flagged a row-level-security (RLS) gap his human review had missed. Without that lens-specific agent, “I would have shipped that.” RLS gaps are exactly the “data-layer correctness” failure mode the Constraint Decay paper categorizes as the leading root cause.


4. Why it’s newly affordable — the DeepSeek + Reasonix math

The obvious objection to “run 5-7 sub-agents per PR” used to be token cost. On Claude Opus 4.7 list pricing, a thorough multi-agent review can burn $1-3 per PR easily. Over a sprint, that adds up.

This is where the DeepSeek V4 Pro permanent discount + Reasonix-style cache-engineered scaffolds change the math.

From the Reasonix benchmark we covered:

  • 435M input tokens / day at 99.82% cache hit rate
  • Actual cost: ~$12 vs ~$61 without cache
  • The system prompt + project context + tool definitions stay stable across calls; only the user query differs

If you architect Nolan’s multi-agent review loop around prefix-cache stability — same system prompt for each lens, same project context, only the lens-specific instruction varies — every additional sub-agent costs essentially the prompt tail in fresh tokens, not the entire context.

Net effect: running 5 sub-agents instead of 1 costs roughly 1.1× the tokens, not 5×. At DeepSeek V4 Pro rates, that’s the difference between “$0.10 per PR” and “$0.11 per PR.” The cost objection evaporates.

The implication isn’t “use DeepSeek for everything” — Opus 4.7 still beats DeepSeek on the hardest reasoning. The implication is the per-lens sub-agent runs can be DeepSeek (cheap + fast + cache-friendly), while the synthesis / final review is Opus 4.7 (premium but only one call). A two-tier setup.


5. The 5-step recipe you can copy this week

Pulled together from Nolan’s flow + Constraint Decay’s failure taxonomy + DeepSeek/Reasonix cost math:

Step 1 — Generate the PR with your best generative model

Use whatever you normally use. Cursor + Claude Opus 4.7, Reasonix + V4 Pro, whatever. Don’t change this step. The fast-generate side stays as-is.

Step 2 — Spawn 5 sub-agents, one per lens

Each gets the same system prompt (cache-friendly) and same PR diff context. Only the lens-specific instruction differs:

LensOne-line instruction
Security”Find auth, injection, RLS, and credential-exposure bugs in this diff. Classify each as critical/high/medium/low.”
Data-layer”Find SQL/ORM correctness, transaction, race, and migration bugs. Classify.”
Edge cases”Find off-by-one, null/undefined, empty-collection, boundary, and timeout bugs. Classify.”
Performance”Find N+1 queries, sync-in-async, memory leak, and quadratic-loop bugs. Classify.”
API contract”Find input-validation, schema-drift, status-code-misuse, and idempotency bugs. Classify.”

Run them in parallel. With prefix-cache, this is ~5 cents on V4 Pro.

Step 3 — Triage with a strict policy

  • Fix every critical and high. No skipping.
  • Skip mediums and lows unless they’re cheap to fix.
  • Abandon the PR if there are 3+ criticals — the approach itself is wrong.

Step 4 — /grill-me until you understand

Run a “quiz me” agent over the now-fixed PR. If it asks something you can’t answer, you can’t ship yet. Go read the code or have the agent explain it until you can.

Step 5 — Ship + monitor

Now you ship. Logs / Sentry / Datadog will tell you if the multi-agent review missed something. Feed that miss back into a new lens for next time.


6. The hard limit — domain knowledge still matters

Commenter Ashton Antony raised the most important warning in Nolan’s thread:

“this approach still requires enough domain knowledge to triage… a junior who can’t distinguish a real race condition from a theoretical one is still going to get overwhelmed.”

This is the catch. The workflow is labor-intensive on the human side, not labor-saving. You’re not saving review time — you’re concentrating your judgment on the highest-value calls. If you don’t have the domain knowledge to make those calls (is this race condition a real production risk, or just a theoretical one?), the sub-agents will flood you with criticals you can’t triage.

Implication for builders:

  • Senior engineers: this workflow probably 2× your effective output (better code, similar speed)
  • Mid-level engineers: probably 1.3-1.5×, depending on triage skill
  • Junior engineers without senior review: net negative — you’ll fix the wrong things, leave the right ones unfixed, and lose time

This is consistent with what we covered in the Microsoft cancels Claude Code post: even at organizations with massive AI budgets, the human-AI interaction model is the bottleneck, not the AI capability.


7. How this slots into the 2026 builder stack

For the cost-engineered version of this workflow, your stack probably looks like:

  • Local on-device pre-review: Cactus + Gemma 4 2B (free, instant, catches dumb obvious bugs before sending to cloud)
  • Per-lens sub-agents: DeepSeek V4 Pro through a Reasonix-style cache-engineered scaffold (~$0.01 per lens per PR)
  • Final synthesis review: Claude Opus 4.7 xhigh OR Codex GPT-5.5 high (single premium call)
  • Local desktop orchestration: Harbor v0.4.19 to wire it all up in one command

For comparison-shopping the agent-tier models that go in the “per-lens sub-agent” slot, see our Gemini 3.5 Flash vs Claude Haiku 4.5 deep dive.


The bigger picture

The Nolan post is the cleanest argument I’ve seen this year for why “AI coding” doesn’t mean “AI replaces the careful programmer.” It means “AI lets the careful programmer be more careful, more often.” The multi-agent review pattern is the operational shape of that idea — and the Constraint Decay benchmark + Reasonix cache math are the empirical and economic conditions that make it shippable in 2026.

If you’ve been measuring AI coding productivity in “lines per hour” or “PRs per week,” this workflow will look like a regression. If you measure in “bugs that reached production” or “PRs that needed urgent hotfixes,” it should be the clearest win in your tooling stack this year.


Sources