EngineeringJune 29, 20267 min readAgimon editorial team

After Six Months of Vibe Coding, Nobody Can Explain Your Repo Anymore

The code keeps compiling. What breaks first is the shared map of what lives where and why. Why vibe coding erodes the system map long before it touches the syntax.

A contractor joins your team in month seven. Their first task is small: add a service. They ask the obvious question, the one any careful engineer asks before touching an unfamiliar repo. Where does this go?

Three of your engineers answer. They give three different answers. None of them is wrong. Each one is reasoning honestly from the part of the codebase they know best, and the parts they know best no longer agree with each other. The contractor picks one, ships it, and the auth check they needed turns out to already exist in two other files, implemented two other ways, for reasons nobody present can fully reconstruct.

Nothing here is a bug. The tests pass. The build is green. The product works in front of customers this afternoon. And yet something has clearly gone wrong, because a simple question with one correct answer in a healthy codebase now produces a small committee meeting.

What broke is not the code. It is the system map: the shared understanding of what lives where and why, the thing that lets one person look at a repo and tell you its shape without disclaimers. That map eroded quietly over six months of vibe coding, and the code kept compiling the entire time it was disappearing.

This is the failure mode worth naming, because the thing teams reach for to describe it (technical debt) is exactly the wrong frame. Debt is a number you pay down. What you have lost is not a balance. It is a map.

The vibe coding failure that does not show up in tests

Start with how this gets described by the people living inside it, before any researcher names it.

On Hacker News, a developer described the texture of it precisely: a workflow of constant AI-assisted patches applied to bugs whose root cause nobody ever addresses, accumulating "massive amounts of code that they don't understand," with abstraction layers leaking across components that were supposed to stay separate. [4] Read that carefully. The complaint is not that the AI writes bad code. The complaint is that the code works well enough to ship and accumulates faster than anyone's understanding of it. Each patch is locally sensible. The sum is a repo that has outrun its own authors.

There is a name for the structural version of this, and it comes from people who measure software quality for a living. Jasper Geurts, CTO of the Software Improvement Group, defines it cleanly: "An architecture drift is that step by step, and commit by commit, you get to a situation where your architecture is different from the intentions." [2]

Sit with the mechanism in that sentence. Not one bad decision. Not a negligent engineer. Step by step, commit by commit, each move defensible on its own, arriving somewhere nobody chose. This is why your test suite is no help. A test asserts that a unit behaves as specified. It says nothing about whether the system still has a coherent shape, because shape is not a property any single unit can hold. Two correct implementations of the same idea both pass. The linter is satisfied. Coverage is up. The architecture has drifted anyway, and the only instrument that would have caught it was a person who could hold the whole repo in their head and notice the divergence. That person is exactly who you no longer have.

The reason this accumulates without ever tripping an alarm is structural, and it is tied directly to how the agents doing the writing actually work.

Context windows cannot hold a codebase

An AI coding agent operates inside a context window. Whatever you can fit into that window is what the agent reasons over; everything else in the codebase is, functionally, invisible to it during that session. This is not a flaw to be patched in the next model. It is the shape of the tool. A context window large enough to hold a real production codebase, with full fidelity, across every session, does not exist, and even when windows grow, the codebase grows with them.

So every session optimizes for local correctness. The agent solves the problem in front of it, with the context it was given, and it solves it well. Then the session ends, the window clears, and the next task arrives with no memory of the shape the last one assumed. The agent re-derives a structure from whatever it can see now. Multiply that across hundreds of sessions and you get a repo assembled from hundreds of locally reasonable decisions, none of which were made with the others in view. No agent planned the global shape. Nobody else did either.

The data on what this produces is not subtle. GitClear analyzed 211 million changed lines of code from 2021 through 2024, drawn from Google, Microsoft, Meta, and enterprise repositories. Over that window, copy-pasted code climbed from 8.3% of changes to 12.3%, while refactoring-associated code, the work that consolidates duplication and keeps a codebase legible, fell from 25% of changed lines to under 10%. 2024 was the first year on record where duplication exceeded refactoring. [1]

Two-series line chart of GitClear data, 2021 to 2024: code duplication as a percentage of changed lines rising from 8.3% to 12.3%, and refactoring-associated code falling from 25% to under 10%, with the 2024 crossover point labeled where duplication exceeds refactoring for the first time on record. — The crossover: in 2024, duplicated code overtook refactoring in changed lines for the first time on record. Source: GitClear, 2025.

Read those two lines together and the mechanism becomes visible. Duplication is what local correctness looks like at scale: rather than find and reuse the utility that already exists somewhere out of view, the agent writes a fresh one that works. Refactoring is the opposite move, the act of stepping back and re-imposing a global shape, and it is precisely the work that a context window cannot motivate because it requires seeing the whole. The consolidating work is disappearing from the record at the same moment the duplication it would absorb is rising.

At the extreme, the Software Improvement Group examined a Cursor experiment that generated more than 3 million lines of Rust in a single week. They scored the result: maintainability 1.3 out of 5, which places it in the bottom 5% of all systems SIG analyzes, and architecture quality 2.1 out of 5. [2] The code was generated. It presumably ran. As a system, it scored near the floor of everything a firm that does nothing but measure software quality has ever seen.

The most coherent response to this data makes a good point, and it deserves a direct answer.

The context-engineering argument and its ceiling

There is a serious counterargument, and it deserves to be met head on rather than waved away, because it is the most thoughtful position in this debate.

Sequoia's Inference newsletter made the case directly in a piece titled "Vibe Coding Needs Context Engineering": the problem is not vibe coding itself but the absence of structure around it. "Intuition does not scale, structure does." [5] The remedy is to externalize the system map into persistent context files (a CLAUDE.md, a .cursorrules, an architecture spec) that every session reads before it generates a line. Give the agent the global picture the context window cannot hold on its own, and its local decisions stop contradicting the whole. This is real, and teams that do it produce more coherent code than teams that do not. I am not going to pretend otherwise.

Here is the ceiling it hits. A context file is only useful while it is accurate, and keeping it accurate is documentation labor: the exact work that vibe coding was sold as eliminating. Someone has to write the map and keep writing it, every time the codebase moves. And the codebase moves faster than the documentation, by construction, because the whole point was to ship faster than you can write things down. So the file drifts from the territory. Now the agent is anchored to a map that is confidently wrong, which is worse than no map, because a wrong map is followed. The structural incoherence does not stop. It just acquires an authoritative-looking source.

If context engineering is the proposed fix, it helps to understand why agents need such explicit anchoring in the first place: they execute the instructions you write rather than the intentions you hold, which is exactly why an out-of-date context file does damage instead of getting questioned. The agent does not notice the map is stale. It follows it.

The second common reassurance is testing. Strong coverage, the argument goes, catches structural drift before it compounds. It does not, reliably. Two correct implementations of the same utility, both fully tested, diverge over months while every test stays green. The suite verifies behavior, not coherence. A duplicated, slowly diverging system can be green to the last assertion and still be a repo no one can explain.

All of which matters because the divergence does not stay theoretical.

What losing the system map actually costs

This is where the financial metaphor does the most damage, because it undersells the bill.

The measured costs are real enough on their own. A study of open-source developer activity before and after Copilot adoption (arXiv:2510.10165) found a 19% decline in original code productivity among senior developers, the thoughtful net-new work, not raw output, alongside a 6.5% increase in the volume of code those same experienced contributors had to review. [3] SIG, drawing on industry analysis, puts the increase in technical debt at 30 to 41% in the year following AI coding tool adoption. [2] Notice the shape of the senior-developer number: the expensive people spend less time building and more time reading code they did not write and cannot fully vouch for. That is the system map eroding, denominated in the salaries of the people who used to hold it.

But the costs a technical PM actually feels do not arrive as percentages. They arrive as three specific failures:

New features break unrelated ones. The implicit dependencies between modules used to live in someone's head. When that mental model fragments, a change in one place quietly violates an assumption in another, and you find out in production rather than in review.
Onboarding slows to a crawl. There is no longer an authoritative answer to "where does this go," so reasonable engineers give reasonable, contradictory answers, and every new hire pays the cost of reconstructing a map that no longer exists as a single thing.
The repo can no longer be handed off. Its navigability lives only as a fading consensus among the people who were present for each decision. Lose two of them and you lose a meaningful fraction of the only documentation that was ever real.

This is a different layer of the same speed problem as building the wrong thing faster. That failure is about direction: shipping the wrong product quickly. This one is about structure: building whatever you build incoherently. They are distinct, they compound, and a small team moving fast with AI is exposed to both at once.

Here is the practical difference the framing makes. A team that calls this technical debt treats it as a maintenance problem: schedule a cleanup sprint, pay it down, move on. That instinct is wrong here, and it is wrong in a way that wastes the sprint. You cannot refactor your way back to a mental model that no one is holding. The files were never the asset that went missing.

A team that recognizes it as drift in the system map treats it as an architectural problem: something that requires deliberate, ongoing attention to the global shape of the repo, not a one-time payment against a balance. The interventions are not the same. One cleans functions. The other rebuilds the shared understanding of what the functions add up to, and keeps rebuilding it, because the thing that erodes it has not gone anywhere.

I am not going to hand you a checklist. The useful move, the only one available before the costs compound further, is to stop mistaking a missing map for a messy room. If you finish this with a sharper name for the thing you have been feeling since around month four, that is the whole point. You cannot fix what you are still calling by the wrong name.

If your repo has quietly become a rumor, send this to whoever still thinks the green build means the architecture is fine.

References

GitClear. "AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones." https://www.gitclear.com/ai_assistant_code_quality_2025_research . Accessed 2026-06-29.
Software Improvement Group. "The next wave of technical debt is architectural, and AI is accelerating it." https://www.softwareimprovementgroup.com/blog/architectural-debt-ai/ . Accessed 2026-06-29.
Xu, Feiyang, et al. "AI-Assisted Programming Decreases the Productivity of Experienced Developers by Increasing the Technical Debt and Maintenance Burden." arXiv:2510.10165. https://arxiv.org/abs/2510.10165. Accessed 2026-06-29.
Hacker News. "Vibe coding and agentic engineering are getting closer than I'd like." https://news.ycombinator.com/item?id=48037128. Accessed 2026-06-29.
Sequoia Capital. "Vibe Coding Needs Context Engineering." Inference newsletter. https://inferencebysequoia.substack.com/p/vibe-coding-needs-context-engineering . Accessed 2026-06-29.