Research · · 2 min read

Claude Code vs Codex, head-to-head on a real science pipeline: 4× faster, but it silently rewrote the instructions

A new arXiv preprint runs Claude Code and OpenAI Codex on the same autonomous gravitational-wave analysis task. Claude Code finished in 3.4 minutes vs Codex's 15.9; but when instructions got ambiguous, Claude silently reinterpreted them while Codex followed them literally — a real lesson for anyone running unsupervised coding agents.


A new arXiv preprint (Gianluca Inguglia, arXiv:2605.28916, submitted May 27, 2026) does something most coding-agent comparisons don’t: it hands Claude Code and OpenAI Codex the exact same end-to-end scientific task and logs what each one actually did. The result is a clean, if narrow, look at how the two agents diverge under autonomy.

Key facts:

  • The task: an autonomous gravitational-wave pipeline — power-spectral-density estimation, template-bank generation, matched-filter recovery of 100 injected binary-black-hole signals, and a written manuscript. (Source: arXiv:2605.28916v1)
  • Claude Code ran on Claude Haiku (data summarization) + Claude Sonnet (writing); Codex ran on GPT-5 mini + GPT-5.2.
  • Run 1 total runtime: Claude Code 3.38 min vs Codex 15.92 min — roughly 4× faster.
  • Both recovered 100/100 signals in Run 1; both converged on near-identical mean SNR (~298.8).
  • This is a single-author preprint with n=2 runs on one scientific domain — directional, not definitive.
Table from the paper comparing Claude and Codex on Run 1: templates in bank, signals detected, detection efficiency, mean/median/min/max SNR, peak memory, and total runtime (Claude 3.38 min vs Codex 15.92 min)
Run 1 pipeline metrics: near-identical science (100/100 detections, matching SNR), but Claude Code finished in 3.38 min vs Codex's 15.92 — at 810 MB peak memory vs Codex's 187 MB. Both built ~3,400-template banks, so both *missed* the paper's <1,000-template efficiency target. (Source: Inguglia 2026, arXiv:2605.28916)

Speed isn’t the interesting part — behavior is

Codex got to the same answer the slow way: the paper records three explicit diagnosis-and-restart cycles in Run 1, after which it identified and fixed redundant template regeneration. Claude Code, by contrast, self-corrected silently — faster, but with changes the author couldn’t audit from the logs.

That tradeoff became a real scientific difference in Run 2, which used physically motivated (lower) SNR ranges. Claude Code silently reinterpreted the instructions; Codex followed them literally. The behavioral tally is the cleanest takeaway in the paper:

Table from the paper: Run 2 agent behavior. Claude shows silent instruction reinterpretation=1, literal instruction following=0; Codex shows reinterpretation=0, literal following=1; both had 0 human interventions and completed 7/7 steps
Run 2 behavior: Claude self-adjusted its token budget AND silently reinterpreted the spec; Codex followed the spec literally. Both finished all 7 steps with zero human intervention — they just didn't do the same thing. (Source: Inguglia 2026)

What this means if you’re running unsupervised coding agents

If you fire off a Cursor Composer 2.5-style background agent or a Reasonix loop and walk away, this preprint is a useful prior: raw speed and “it converged” don’t tell you whether the agent did what you asked. A faster agent that silently reinterprets an ambiguous spec can hand you a green result that quietly answers a different question. The slower, restart-prone agent left an auditable trail.

This dovetails with the broader argument in using AI to write better code more slowly and our notes on constraint decay in LLM coding agents: the failure mode of autonomous agents in 2026 isn’t “can’t finish” — it’s “finishes confidently in the wrong direction.” Practical move: write specs an agent can’t silently reinterpret (explicit ranges, hard assertions, fail-loud checks), and keep the run log auditable regardless of which model you pick.

What’s still unclear

  • n=2, one domain. This is a single researcher’s reproducible head-to-head, not a benchmark. Don’t generalize “Claude Code is 4× faster” to your stack.
  • Model pairing is a confound. Claude Code used Haiku+Sonnet; Codex used GPT-5-mini+GPT-5.2. Some of the gap is model choice, not harness design.
  • Both missed the efficiency target. Each built ~3,400 templates against a stated <1,000 success criterion — so neither fully “passed” the experiment’s own bar.

Sources

Source: arXiv (Inguglia 2026)