# We gave Replicas our entire backlog

What happened when we ran our real product backlog through Replicas and graded every pull request.

- Author: Saai Arora
- Published: 2026-05-16
- Category: Experiment
- Canonical: https://tryreplicas.com/blog/we-gave-replicas-our-entire-backlog

Connor and I have been doing a lot of thinking recently around self-maintaining codebases. You might have caught on if you follow me on Twitter.

<img src="/blog/self-maintaining-codebase-tweet.png" alt="Tweet from Saai Arora describing what a self-maintaining codebase looks like" />

It started when Ramp published a piece about Ramp Sheets. They used over 1,000 AI-generated monitors, one for every 75 lines of code, to monitor production, triage alerts, and propose fixes for bugs in real-time.

We could not stop talking about it for a week.

Because if you really think about it, "self-maintaining" is not one thing. It is a stack. Long-term memory so the agent knows what it has already tried. Background execution so it can work without a human watching. Triggers from where work actually originates: Linear, Slack, GitHub. And a verification loop so it does not ship slop.

We have been building most of those pieces at Replicas for the past few months without calling it "self-maintaining." Background agents in sandboxed VMs. Triggers from every surface engineers actually live in. Environments so the agent can verify locally before opening a PR. Greptile in the review loop. The pieces were there. We just had not pointed them at a real backlog and watched.

So we did.

We took our entire triage from our own Linear. Every open bug, every improvement, every feature request. Then we handed all of it to Replicas. Every single one. No cherry-picking, and no warming up. If the backlog is what self-maintaining has to chew through, then the backlog is what we were going to evaluate against.

## What we ran

Every ticket got at least one autonomous Replicas pass on Claude Opus 4.7. If the first pass did not land, the follow-up iterations ran on Codex GPT-5.5. Greptile sat in the loop as automated code review the whole time. When it flagged something, Replicas read the comment and tried to fix it without anyone stepping in. Only after that loop wrapped did I open the PR myself and grade it.

The bar was strict: would I merge this from an engineer on our team?

## Five buckets

- **One-shot (50%):** Mergeable as-is from the first PR.
- **Close (15%):** Needed a trivial human fix to land.
- **Iterated (23%):** Multiple rounds of feedback before mergeable.
- **Wrong (5%):** Looked right but was not, or I had to redo it.
- **Abandoned (7%):** Closed without merging.

## Outcome


With that being said, this number comes from our codebase, our tickets, our scoring. A different team running the same experiment on their stack would get a different result. The point is that this is the actual hit rate when you run a real product team's actual backlog through a real background agent. Not curated SWE-bench problems.

What is more interesting than the number is which tickets landed where, and why.

## What worked

The clean one-shots were boring in the best way. UI changes with clear specs, bug fixes where the solution was all in one file, trivial wiring. The kind of work that is high-volume in any backlog and that I genuinely do not want to spend my time on.

A few moments stood out as more than "the agent followed instructions."

- **Replicas self-tested a fix without being told to.** On REP-503, fixing a batch of React lint rules, it spun up the app inside its sandbox, ran through it once, and only then opened the PR. We had not asked it to test. It just did.

- **It attached screenshots unprompted.** On REP-417, a kanban board feature, it took screenshots of the working UI and dropped them into the PR. Nobody asked.

- **It opened its own GitHub issue.** This is the one that made Connor and me sit up. On REP-372, adding OpenCode as a supported harness, Greptile reviewed the PR and flagged a structural issue broader than the ticket. Replicas read the comment, agreed, declined to merge, and opened a new GitHub issue describing the broader problem.

This is the behavior we had been trying to build toward without quite knowing what to name it. It is not "agent writes code." It is "agent participates in the engineering process." A single agent writing a single PR in isolation cannot maintain a codebase. An agent that knows when to write a PR, when to file an issue, and when to stop, can.

## What did not work

Almost every failure collapses into the same pattern. Three categories.

- **We gave a bad brief.** REP-444 was a naming inconsistency in our audit log. The ticket lumped together a naming issue with a separate request that was already done. The agent did not disambiguate. The ticket was the problem.

- **We did not tell it to verify.** REP-434, letting Claude Code call exitPlanMode and reflecting it in our UI, is the one I keep coming back to. I explicitly asked it to test. It tested the SDK in isolation, which means it never actually checked whether the feature worked from the dashboard. Three iterations later, we had a working but ugly UI.

- **We cut the loop short.** REP-353 was an admin usage dashboard. About medium complexity, fully tractable. My note while triaging: this is definitely something it would have one-shotted if we had told it to keep working until it was finished. Instead it produced a half-built version and stopped.

**The pattern underneath all of it:** the agent followed our instructions, and our instructions were the bottleneck.

**The two genuinely wrong cases, REP-512 and REP-484,** both needed deep familiarity with parts of our codebase the agent did not have enough grounding on. Real capability gaps. But a much smaller share of the failures than I would have guessed.

## The recovery rate matters as much as the one-shot rate

When Opus did not land a ticket on the first pass, we swapped to Codex GPT-5.5 for the iteration step. Roughly half of those iterations finished in one or two passes after the swap.

This cuts against the way most people frame the "which model is best" debate. The answer is neither, alone. Opus was the better first-pass author because it usually is able to figure out ambiguous tasks better. When it landed, it landed clean, with the kind of judgment that produced the unprompted screenshots and the auto-filed GitHub issue. Codex was the better cleanup model. When Opus left something half-built, Codex was faster at converging on a fix.

You do not get this composition by picking a winner. You get it by building a harness that routes work between models based on what stage of the loop you are in.

Single-model agents are about to look the way single-model chatbots looked two years ago. Fine, but obviously leaving capability on the table. The next layer up is orchestration.

## The model is rarely the bottleneck

Sit with the failure list for a moment. It consists of a bad brief, no verification step, loop cut short, no screenshot, no tests.

The model is rarely the bottleneck. It is everything around it.

This contradicts how most of the discourse on coding agents is framed right now. The question everyone asks is "is Opus 4.7 better than GPT-5.5 at coding." It is the wrong question. The agent's raw capability matters less than the system around it. How cleanly the ticket is written. How aggressively it is verified. How long the leash is. Whether it knows when to escalate versus when to merge. And which model is doing which job at which point in the loop.

A great model in a bad harness can ship slop. Two good models in a great harness will ship an entire backlog.

## What this means for self-maintaining

Running this experiment, I think self-maintaining codebases are closer than I expected. With one important asterisk.

The 50% one-shot rate is not the rate at which a self-maintaining system could currently operate, because every one of those PRs still passed through me. The number for true self-maintenance is much closer to zero today. Nothing in our stack would have known which of those PRs to auto-merge and which to escalate. The agent's output was good. The judgment around that output was still mine.

So the question becomes: what would let us pull the human out of that judgment loop, safely?

- **Better briefs, automatically.** A pass that rewrites a Linear ticket into a fully scoped brief, with acceptance criteria the agent will be measured against.
- **Mandatory verification.** The next step is verifying by default, with the type of verification chosen by the kind of change.
- **Model routing inside the loop.** The Opus to Codex pattern we hit by accident is something the harness should do on purpose.
- **A confidence signal at PR time.** The agent should know, and tell us, when it is confident enough to auto-merge versus when it should pause. The REP-372 moment is the prototype.
- **Longer loops, not shorter.** REP-353 told us our leash is too short. Self-maintaining requires the agent to iterate against its own output without a human kicking off each round.

We are building toward all five. None of them require a smarter model.

## Where we go from here

We will keep running this experiment as our backlog refills. I am sure that the 50% number will change. What we expect to stay constant is the pattern underneath. When our agent fails, it is almost always because we did not tell it enough, did not verify enough, did not let it run long enough, or did not put the right model on the right part of the job. Not because any one model could not do the work.

If you are thinking about handing your backlog to an agent: the question is not which agent is smartest. It is which one will come with a harness that is built for the work.