Let the Agent Merge
We still have room to speed up, and it's stuck behind human approval. We keep it there because we haven't automated alignment yet.
I’ve been thinking about how teams build software once AI can write whole features well. We already ship more code than ever, but reviewing it feels more and more like theater. Not because people stopped caring. Because review is aimed at the wrong thing.
The old rules made sense: small PRs, a human in the loop, careful line-by-line review. They made sense because humans were the bottleneck in turning product intent into code. With good coding models, that’s less and less true. The implementation isn’t the fragile part anymore. The fragile part is whether the product intent was clear in the first place.
So I don’t think the right question for a PR is “did a human inspect this diff?” That question is becoming theater. The better question is whether the change is aligned on three axes: product alignment, architectural sanity, and release risk.
Product alignment means the change actually does what we wanted. It matches the user journey, preserves the existing behaviors that matter, handles the weird cases, and moves the system from the old state to the new one in a way we can explain.
Architectural sanity means the change fits the system. It respects the boundaries we care about, follows the tradeoffs we chose, and doesn’t make the next version harder for no reason.
Release risk means the change has enough proof around it: tests, migrations, observability, rollback paths, security and privacy checks, and known failure modes.
That is the real review. Not “a human clicked approve after scrolling through files.” The system should prove the change is aligned, and when confidence is high across those axes, the agent should approve and merge.
A lot of this can be automated if the company has the right records. Product tickets, design notes, ADRs, UX examples, prior behavior, incidents, support threads, existing code patterns: all of it is review material. If those records are current, an agent can often review a PR better than a tired human looking at a GitHub diff. If they’re missing or stale, that’s not a reason to keep manual review forever. It’s a sign the product isn’t well communicated.
There is a popular belief that some domains are too sensitive for full automation though. Billing, permissions, security, customer data. But I don’t buy that as a general rule. The problem with billing isn’t that agents can’t write safe billing code. It’s that billing is hard to define.
Billing has too many states, exceptions, migrations, customer promises, Stripe details, internal assumptions, and old weird decisions. You can’t define the ideal billing system in one sitting and hand it to an agent. But that isn’t an argument for humans reviewing every billing diff. It’s an argument for making the product slice smaller.
Build baby-billing v1, then v2, then v3. Each version should be small enough that we can define it thoroughly. What changes? What must not change? What happens to existing customers? What happens on failure? What do we log? What do we test? What’s the rollback? What would make us stop the rollout?
Once the slice is that clear, I trust the agent to implement it. The sensitivity of the domain doesn’t mean we need more humans staring at code. It means we need sharper definition, stronger invariants, and better proof.
This is the part I didn’t fully see before. For years we talked about small PRs and modular systems as ways to make code easier to reason about. That was true, but incomplete. Smallness matters because it makes product intent easier to reason about. Large products are hard not because they contain many files, but because they contain many hidden promises.
Think about this flow: we go from idea → … → code. Every arrow in the middle is an explanation of what we’re building, and each one adds assumptions, because that explanation is based on a previous stage that doesn’t cover every possible scenario. Even the original idea itself is probably incomplete. My claim is that engineering was proactively confirming assumptions with stakeholders before coding. And with coding agents, that’s now gone. Product passes specs to engineering, which passes them to a bot. The bot doesn’t push back on ambiguities and builds with assumptions. And that’s why we’re scared to touch Billing.
So the job changes. We should spend less time babysitting code and more time communicating the product.
That means keeping the product change small, and not just the diff small. A tiny diff in a vague product area can still be dangerous. A larger diff that implements a sharply defined product slice can be safer.
It means writing more about the user and system journey: before and after states, affected users, edge cases, failure modes, invariants, and what must never happen. Then we should use AI to attack that definition before we start building. Ask what’s missing. Ask what could be misunderstood. Ask what an old customer might experience. Ask what hidden assumptions the codebase probably has.
It also means creating master artifacts for coding agents. An agent shouldn’t start from a Linear ticket and vibes. It should get the product notes, design notes, ADRs, code, old behavior, new behavior, examples, tests, and risks. The artifact’s job is to teach the agent the product delta.
Then the agent should build, write product-led tests, self-review, and approve when confidence is high. Humans should stay where their judgement matters: defining the product change, reviewing the architecture, and deciding the risk policy. They shouldn’t be the final checkpoint on every diff.
This is already how I prefer to work. Most of my time goes into understanding the feature, shrinking it, and refining the definition. The agent does the coding and much of the review. The remaining bottleneck is the ritual of asking a human to approve code they aren’t really reviewing.
I don’t think the future is AI writes code and humans review code.
I think the future is humans define the product, agents implement it and prove alignment. When the proof is strong enough, the agent should merge.
The exciting part is that this is a good problem to have. We’re already seeing a huge step-change in output, and the next question is whether we’re at capacity. My proposal isn’t to take judgement out of the process. It’s to move judgement to where it has more leverage: defining the product change clearly, shaping the architecture, and deciding what proof we need before something lands. If we get that loop right, our dev speed could see a second explosion.
