Article

Agents Build Spaghetti Code Too — and Then They Get Stuck in It

Morten Aslo-Østergaard 6 May 2026 Reading

AI coding agents are great at building new things and surprisingly fast at making the codebase unworkable for themselves. Catching it early requires measurement that doesn't change between runs.

A while back I was on a project building a .NET and React platform that was moving fast. Rapid prototyping, phase shifts every few weeks, cleanup deferred because the next idea always mattered more. AI agents were writing most of the code, which was the only way we kept up.

For a while it worked beautifully. Then it stopped.

UI changes started landing in the wrong files. The backend drifted out of step with what the frontend was assuming. The agents had built something fast, and the same agents were now getting confused by what they had built. Asking for a small change increasingly meant fixing three other things first.

The conditions weren’t kind to begin with — rapid prototyping rarely is, and most agent projects don’t fall apart this visibly. But the extreme case is just a louder version of the quiet one. Even modest complexity slows an agent down. It also slows humans down. Complex code is harder for everyone to work in. Find and fix the hotspots early, and everyone moves through the code faster.

The underlying observation has stayed with me though, because this isn’t only a legacy-systems problem anymore.

Spaghetti is no longer just a legacy story

When people talk about why AI agents struggle on a codebase, they reach for the usual suspects: old systems, missing documentation, complex spaghetti code. All true. All also problems we’ve had since long before the agents arrived.

What’s new is that AI-built codebases drift into the same shape. Sometimes within weeks. Duplicated logic spread across silos because nobody — human or agent — remembered the other place it lived. Functions that grew from twenty lines to two hundred without anyone deciding to let them. Dead code piling up from iterative rewrites, three versions of the same thing in the tree because the agent moved on and nobody removed the old ones. Test coverage that started thin and stayed thinner.

The agent that helped build the mess has no special advantage when it comes back to clean it up.

Agent hygiene helps

A lot of this is preventable. Clear agent instructions, well-scoped skills, good CLAUDE.md files, deliberate architectural guardrails — a skilled operator with the right setup ships noticeably less chaos than someone who just points Claude at a repo and hopes. That matters, and I’d rather start there than anywhere else.

But even good practice doesn’t take the drift rate to zero. And it doesn’t give you a way to check whether the guardrails are actually holding, short of opening every file and reading. Hope isn’t a measurement strategy.

We don’t write the code anymore

There’s a quieter version of this problem that I think matters more than the technical one.

When you wrote the code yourself, you trusted it because you remembered writing it. You knew which bits were load-bearing because you’d put them there. You earned that trust by being inside the work.

We don’t write most of it anymore. The agents do. As an experienced developer reviewing what an agent has produced, I can tell you whether the shape looks right — but I can’t tell you, just by reading, whether the duplication is now a problem, whether one function’s cyclomatic complexity has quietly become a hotspot, whether the naming is consistent with how the rest of the codebase has evolved over the last hundred commits.

Reading isn’t enough. You need a measurement layer that wasn’t part of the code being reviewed.

Deterministic and probabilistic, not one or the other

The instinct in our industry is to throw an LLM at the problem. “Just have Claude grade the codebase.” It works, partly. LLMs are good at judgment, at noticing when something feels off.

But ask the same LLM the same question on Tuesday and Thursday and you’ll get two different answers. Useful for triage, useless as a baseline you can track over time.

What you actually want is both. Deterministic analysis for ground truth — AST parsing returns the same number for the same repo, every time, no matter who’s asking. Probabilistic analysis for judgment on top of those numbers, where LLMs are at their strongest. The dichotomy of “rule-based tools versus LLMs” is a false one. Production systems need both, doing what each is good at.

What measurable actually looks like

This isn’t hand-wavy. The interesting metrics for AI-readiness reduce to numbers that are reproducible and easy to defend.

Median cyclomatic complexity below 5: healthy. Above 30: the agent is going to have a bad time. Code duplication above 3% is where you start fixing the bug in one place and missing it in three others. Function length averaging under 20 lines means each one fits in context cleanly; averaging over 200 means they don’t. Type annotation coverage above 90% turns what the agent guesses at into what the agent knows. A test-to-code ratio under 0.05 means there’s effectively no feedback loop, and you can’t trust an agent to verify its own work without one.

None of these numbers are arbitrary. They are the dials that, turned the wrong way, predictably degrade what an agent can do for you.

Security is the same story with a heavier penalty

Agents reproduce what they see. If your codebase contains hardcoded secrets, SQL injection patterns, or vulnerable dependencies, the agent will cheerfully reproduce more of them. The security posture of the codebase becomes the security posture of the agent’s output.

There’s a particular catch with dependencies that nobody talks about enough. LLMs were trained on years of public code, including library versions that have since had CVEs filed against them. The model doesn’t know what it doesn’t know — it’ll suggest a package version that was current in 2023 and look you in the eye while doing it. A deterministic CVE check against a current vulnerability database catches that. The LLM can’t.

The same loop runs for SBOM generation, increasingly required under the EU Cyber Resilience Act for software sold into the EU market. That’s a parser job, not an LLM job.

What we do with this

We’ve built our own tooling along these lines and run it on every client codebase we work on. It produces a readiness score across structural complexity, organisation, documentation, testing, security, and how cleanly the code chunks into AI context windows. Often it’s the first thing we do on a new engagement — it gives us and the client a shared map before we touch anything. A separate post about the product is coming.

The principle is the part that doesn’t depend on which tool you use. If you’re letting agents work on a codebase you haven’t measured, you’re hoping they’ll do well — you’re not managing whether they will. Trust used to come from writing the code yourself. Now it has to come from somewhere else.

Worth knowing where.