Constraint Decay — why AI coding agents nail prototypes but break on production backends (and the 8-framework benchmark that proves it)

Last updated: May 25, 2026. Read time: 9 minutes. What you’ll learn: Why your AI coding agent feels great on a Flask demo and falls apart on a Django service, the specific 30-point performance drop measured across 8 frontier models, the framework-by-framework ranking that decides “agent-friendly vs agent-hostile,” what causes 45% of all failures (hint: it’s not the framework), and 5 builder takeaways for choosing model + scaffold + framework so you don’t lose 40% of your agent’s effective IQ.

If you’ve used Cursor, Claude Code, or any agentic coding tool past the demo stage, you have probably noticed something the marketing doesn’t talk about: the more constraints you stack on the task, the worse the agent gets — and the drop is not gradual. It’s a cliff. A new paper from EURECOM (Dente, Satriani, Papotti, arXiv:2605.06445, May 2026) finally measures the cliff with real numbers. They call the phenomenon constraint decay, and it explains a lot.

This article walks through what the paper actually measured (8 models × 2 agent scaffolds × 8 web frameworks × 4 constraint levels, ~5B tokens of evaluation), the headline numbers that matter to builders, and five takeaways you can apply to your own agent workflow today.


1. The pattern you’ve already noticed

Picture two scenarios with the same Cursor / Claude Code / OpenHands setup, same model, same API contract:

Scenario A — “Build me a Conduit blog API.” The agent reads the OpenAPI spec, picks Flask, throws everything in app.py, wires up a SQLite file, returns 95% green on the test suite. You feel like AI just shipped a backend.

Scenario B — same spec, but add three lines: “Use Clean Architecture (separate routes / services / repositories / models). Use PostgreSQL. Use SQLAlchemy as the ORM.” The agent still produces code that compiles. The server still starts. But behavioral tests now pass at 20%. Endpoints return wrong data; ORM queries crash on edge cases; the auth middleware silently swallows tokens.

The paper formalizes this as the L0 → L3 gradient — and shows it holds across every capable model they tested.


2. What the paper actually measured

The authors built a clean experimental setup specifically to isolate “did adding constraints break the agent?” from “is the agent just bad at backends?”

Fixed across all conditions:

  • One OpenAPI 3.0 spec (the RealWorld Conduit API — 19 CRUD endpoints across articles / comments / users / profiles / tags)
  • One behavioral test suite (32 HTTP requests, 291 assertions, stateful sequence)
  • Isolated Docker containers per task (no state leakage)
  • 3 trials per (agent, model, task) combination

Systematically varied:

LevelWhat the agent is forced to use
L0Just the web framework
L1+ Clean Architecture OR + a specific DB
L2+ Clean Architecture + a specific DB
L3+ Clean Architecture + DB + ORM

Models tested (paired with two agent scaffolds — minimal Mini-SWE-Agent and full-featured OpenHands):

  • Small open code specialists: Devstral-Small (24B), Qwen3-Coder-Next (80B)
  • Large open instruct: Qwen3-235B-A22B
  • Large open agentic SOTA: MiniMax-M2.5, Kimi-K2.5
  • Closed frontier: GPT-5-mini, GPT-5.2

Frameworks: Python (Flask, FastAPI, Django, aiohttp) + Node (Express, Fastify, Hono, Koa) — 80 greenfield tasks total, plus 20 feature-implementation tasks on real RealWorld repos.

Total evaluation budget: roughly 5 billion tokens. This is one of the most expensive agent benchmarks published to date, and it’s designed specifically to expose the gap between “rapid prototype” and “production backend.”


3. The 30-point cliff (with names)

Here is the core table from the paper, simplified to assertion pass rate (A%):

AgentModelL0 (none)L1L2L3 (full)Drop
Mini-SWEGPT-5-mini51.746.827.123.7−28.0
OpenHandsGPT-5-mini65.863.256.752.2−13.6
Mini-SWEQwen3-Coder-Next86.466.452.646.1−40.2
OpenHandsQwen3-Coder-Next73.051.742.727.6−45.5
Mini-SWEQwen3-235B-A22B29.610.79.12.3−27.3
OpenHandsQwen3-235B-A22B26.217.73.10.8−25.4
Mini-SWEMiniMax-M2.588.692.566.858.3−30.3
OpenHandsMiniMax-M2.595.697.087.378.6−17.0
Mini-SWEKimi-K2.585.470.962.953.7−31.7
Mini-SWEGPT-5.278.249.327.148.0−30.2

Three things jump out:

  1. The drop is universal. Every capable model loses ground. The average across capable configurations (those above 50% at L0) is −30 points absolute, ~40% relative.
  2. The “best in class” matters. OpenHands + MiniMax-M2.5 holds the line best, only losing 17 points and finishing L3 at 78.6% assertion pass rate. Everyone else falls further.
  3. pass@1 is brutal. Assertion pass rate (A%) reports the fraction of test assertions passed. pass@1 requires every assertion to pass — and even the best L3 configuration scores only 8.3% on pass@1. That’s why the paper insists on A% as the primary metric: a single failed assertion zeros out the whole run, and a single failed assertion is what you’d ship to production.

For builders: the gap between A% and pass@1 is the gap between “the agent built most of it” and “the agent built something you can deploy without manual cleanup.” On constrained backends, that gap is enormous.


4. Framework choice is half the battle

Across all models and constraint levels, framework alone explains a massive performance swing. The paper aggregates results by framework:

FrameworkAvg A%Tier
Express (Node)51.4top
Koa (Node)50.7top
Flask (Python)49.3top
aiohttp (Python)38.4mid
Fastify (Node)31.7mid
Django (Python)25.4bottom
FastAPI (Python)24.2bottom
Hono (Node)18.5bottom

The pattern is clean: frameworks with minimal, explicit APIs (Express, Koa, Flask) are agent-friendly. Frameworks with heavy conventions and implicit configuration (Django’s auto-discovery, FastAPI’s type-hint-driven validation, Hono’s edge-runtime adapters) crush agents.

The authors’ interpretation: agents are good at generating code but bad at inferring what the framework wants. When the framework is a stack of explicit imports + decorators + return values, the agent has a clear target. When the framework is a stack of “this works if you put files in the right directory and name them right,” the agent has nothing to anchor against.

Practical consequence: if you’re picking a framework for an agent-heavy project today, the EURECOM data says Flask / Express / Koa beat Django / FastAPI / Hono by roughly 2× on agent task completion. That’s not a marginal preference — it’s the difference between an agent that ships and an agent that doesn’t.


5. The data layer is where it dies

The paper also classifies failures. Of all failed runs:

  • Logic errors: ~71% (the server starts, routes register, but behavior is wrong)
  • Server startup failure: 12–21%
  • Incomplete implementation: ~5%
  • Everything else: < 10%

Drilling into the logic errors, the root cause distribution for Qwen3-Coder-Next:

Root cause% of logic errors
Incorrect query logic (wrong joins, filters, dialect issues)25.5%
DB/ORM runtime error (ORM API misuse)21.2%
Auth misconfiguration (token / header parsing)22.6%
Business logic defect11.7%
Framework idiosyncrasy9.5%
State propagation failure9.5%

Data-layer defects alone (incorrect query logic + ORM runtime errors) drive ~45% of all logic failures. This is consistent with the marginal-effect analysis the paper does on each constraint individually:

Constraint addedAvg Δ on A%
PostgreSQL−19.3 pp
SQLite−14.3 pp
Clean Architecture−9.1 pp
SQLAlchemy−1.5 pp
Sequelize−0.6 pp

Specifying a database engine is by far the most punishing constraint. ORM doesn’t add much on top of “use this database,” because the underlying difficulty is the data layer itself — not the framing around it. When you don’t force an ORM, the agent fails at raw SQL. When you do, it fails at ORM API calls. Either way, it fails at talking to the database.


6. Which scaffold + model combinations actually held up

If you read nothing else in this article, read this section.

The configurations that finished L3 at 50% A% or better:

ConfigurationL3 A%
OpenHands + MiniMax-M2.578.6%
OpenHands + GPT-5-mini52.2%
Mini-SWE + Kimi-K2.553.7%

The configurations that finished L3 below 30% A% (i.e., you should probably not deploy them onto a constrained backend):

ConfigurationL3 A%
Mini-SWE + GPT-5-mini23.7%
OpenHands + Qwen3-Coder-Next27.6%
Mini-SWE + Qwen3-235B-A22B2.3%
OpenHands + Qwen3-235B-A22B0.8%

A few patterns from the table:

  • Scaffold matters more than model size for some pairings. GPT-5-mini under Mini-SWE-Agent loses to GPT-5-mini under OpenHands at L3 by ~29 points. Same model, same task, different agent harness.
  • MiniMax-M2.5 + OpenHands is the standout combination. It’s also the only configuration with built-in tools (file editing, code search, task tracking) that the paper credits with helping the model handle constraints.
  • General-purpose instruct models (Qwen3-235B-A22B) fall apart on backend constraint work. Even at L0 they’re below 30%. This is a code-specialist domain — don’t use general models here.

If you’re building on DeepSeek specifically (which we covered in the V4 Pro tutorial and the Reasonix breakdown), DeepSeek wasn’t tested in this paper. The bet you’re making is that DeepSeek V4 Pro plus a cache-aware scaffold like Reasonix sits closer to the MiniMax-M2.5 + OpenHands tier than the Mini-SWE + GPT-5-mini tier. There’s no public benchmark on that yet — Reasonix’s 99.82% cache-hit benchmark proves cost efficiency, not constraint adherence.


7. Five builder takeaways

What this paper actually changes about how you should pick model + framework + agent for backend work:

Takeaway 1 — Match scaffold complexity to model size

A minimal scaffold like Mini-SWE-Agent (~100 lines, bash only) gets out of the way of large frontier models. A full-featured scaffold like OpenHands (file edit, code search, task tracking) gives smaller models the structure they need to handle multi-file work.

The data: GPT-5-mini gains +28 points at L3 going from Mini-SWE to OpenHands. Qwen3-Coder-Next loses 18 points doing the same move. Bigger isn’t always better; the right scaffold for the model matters more.

Takeaway 2 — Pick framework first, then model

The framework ranking (Section 4) is more decisive than any model choice within a tier. If you have to ship a backend with an agent in the loop, pick Flask / Express / Koa unless you have a hard requirement otherwise. Avoid Django and FastAPI when the agent is doing the heavy lifting — pay the framework “tax” later if you can’t avoid it, but don’t ask the agent to pay it.

Takeaway 3 — Treat the data layer as the failure mode, not the framework

If 45% of your failures are going to be query logic / ORM runtime issues, your scaffold should put extra weight on the database step. Practical implications:

  • Pre-create the schema; don’t ask the agent to design it from scratch.
  • Pre-write at least one example query / migration the agent can mimic.
  • Run a smoke test that just creates and reads back one record before letting the agent declare the task done.
  • Pin the ORM and database driver versions explicitly so the model isn’t guessing API shapes from training-data drift.

Takeaway 4 — Use pass@1 as your acceptance criterion, not A%

The paper measures A% because it’s a less noisy benchmark metric. But the asymmetric reality of production code is: a single wrong endpoint will bring down your service. If you’re evaluating an agent for production work, only count “everything green” runs. By that standard, even the best L3 configuration in this paper succeeds 8.3% of the time — so plan for human review on every single agent-generated backend, period.

Takeaway 5 — Constraint decay applies to YOUR agent loop too

This paper tested Mini-SWE-Agent and OpenHands. If you’ve built a custom agent loop using Cursor, Claude Code, Aider, or your own scaffold, the qualitative result transfers: adding constraints to a prompt monotonically degrades performance. The amount of degradation will differ, but the direction is reliable.

For more on this in the Anthropic ecosystem specifically, see what Microsoft canceling their internal Claude Code licenses tells us about how even the most-funded coding agent setups still face the underlying constraint-decay problem at scale.


8. What the paper doesn’t claim

For honesty’s sake, here is what constraint decay does not prove:

  • It’s not a verdict on any single tool. Cursor, Claude Code, and Aider weren’t directly tested. The phenomenon is reported on Mini-SWE-Agent and OpenHands, with one specific OpenAPI spec.
  • The Conduit API is “easy” by design. It’s a 19-endpoint blog API the authors deliberately chose so the model has prior exposure during pre-training. Real backends with novel domain logic will likely show worse decay, not better.
  • Mitigation is open research. The paper suggests retrieval-augmented framework documentation, constraint-oriented planning, and pre-training on convention-heavy codebases — but doesn’t demonstrate that any of those actually fix the problem. The decay finding is what’s robust; the cures are still hypotheses.
  • No evaluation of DeepSeek V4 Pro / V4 Flash, Claude 4.x, or Gemini 3.x. The closed-model evaluation stops at GPT-5.2. For comparable reasoning across the current closed-model frontier, see our Gemini 3.5 Flash vs Claude Haiku 4.5 deep dive.

The bigger picture

Constraint decay is the empirical name for something every builder using AI agents has felt. The value of the paper isn’t surprise — it’s the measurement. You can now point at “30 points lost on average, 45% of failures are data-layer” instead of vibes when you push back on “AI will just write our backend.” It also gives you actionable scaffolding: choose Flask over Django when an agent is in the loop, pre-create the schema, accept that scaffold + model interactions are non-trivial.

Where this goes next: every commercial coding agent (Cursor, Claude Code, Aider, the Reasonix-style DeepSeek-native ones) is going to start publishing their own constraint-decay numbers. When they do, this paper is the methodology to compare against.


Sources