Why I Still Read the PR

The question I keep getting from teams that have leaned hard into AI tooling is some version of: do humans still need to review pull requests? Claude can write the code. CodeRabbit can review it. Graphite Diamond can flag the style drift. The reflex answer — the one I almost gave the first few times — is that yes, of course we still review, because we don't trust the bot all the way yet. That answer is wrong twice over. It underestimates what the bot is already good at, and it misses what review was actually doing.

I sat with it longer than I expected. The PR reviews that I'd point to in retrospect as the reason a codebase stayed coherent were almost never the ones where I found a bug. They were the ones where I noticed — three commits in — that someone on the team had already solved the caching problem in a way the rest of us didn't know about. Or that we now had two slightly different ways to call the same Auth API. Or that the helper this PR was rebuilding had been merged last Tuesday under a different name. None of those are defect catches. They're alignment catches. And the bot doesn't make them — or rather, it makes some of them, but the wrong half.

What 14% Actually Means

The clearest evidence I've seen on this is also one of the oldest. Bacchelli and Bird, working with Microsoft Research, hand-classified 570 review comments across 200 review threads and surveyed 873 engineers about what reviews were for. The headline: only 14% of comments concerned defects. The other 86% spread across code improvements, alternative solutions, knowledge transfer, team awareness, and shared ownership. Their own framing was that reviews are "less about defects than expected and instead provide additional benefits such as knowledge transfer, increased team awareness, and creation of alternative solutions to problems" (ICSE 2013).

Two years later, a different Microsoft team replicated the finding and put the conclusion in the title of the paper: "Code Reviews Do Not Find Bugs". Roughly 15% of comments indicate a possible defect; at least 50% concern long-term maintainability. Two separate studies, a decade apart, on one of the most-reviewed codebases on earth, both landing in the same place.

Defect-catching was always the cover story. Team-shaping was always the work.

If that's true — and the data is hard to argue with — then "will AI replace code review" was always pointing at the smaller, more replaceable 14% of what review is.

Knowledge Is a Wall You Build Brick by Brick

The Microsoft data turns from interesting to decisive when you look at the second paper's other finding. Czerwonka and colleagues measured what the bug-catching frame never bothered to measure: the quality of review comments as the reviewer became familiar with a part of the codebase. A reviewer with no prior exposure produced comments the author rated useful 33% of the time. By the third review of the same area, usefulness climbed to roughly 67%. Familiarity, not eyeball-count, is the limiting reagent.

Sadowski and colleagues found the same compounding curve at Google, across nine million reviews. The average number of comments a Google engineer leaves on a change drops by more than 50% between their first and fifth year at the company — not because they review less, but because they've absorbed enough of the codebase to be selective (ICSE-SEIP 2018). The same paper traces the origin of code review at Google to a single engineer who introduced it specifically to "force developers to write code that other developers could understand" — not to catch bugs.

This is what I mean when I say knowledge in a team is built brick by brick. Each PR you read lays one more. The auth quirk you didn't know about. The helper you forgot existed. The convention the team quietly adopted while you were heads-down on something else. Lave and Wenger called this legitimate peripheral participation — newcomers in a community of practice move from the edge to the center by watching real work get done, not by reading a handbook. Code review is the canonical software-team version of that mechanic.

A useful test: if you can describe what your team's auth layer looks like today without having read any of its recent PRs, you're probably wrong. The wall is only the bricks you actually saw laid.

Yes, the Bot Will Eat Some of This

I want to be honest about where this gets uncomfortable. AI reviewers in 2026 are not what they were in 2024. CodeRabbit's own engineering write-up describes a semantic index over functions, classes, tests, and prior PRs, plus co-change maps built from commit history. They will, in fact, sometimes flag that a new helper duplicates one shipped three weeks ago. Graphite Diamond markets itself as codebase-aware beyond the diff. The easy bricks — the did you read the style guide bricks, the we already have a function for this bricks — are increasingly being laid by software.

There's a serious counter-argument from people I take seriously, and I want to engage it directly. Dave Farley has argued for years that the pull request is the wrong unit of review — late, asynchronous, and downstream of where the design decisions actually happen. Martin Fowler cites a team that spent 130,000 hours in 2020 waiting on PRs, 91% of which received no substantive comment. Kent Beck's original argument for pair programming was essentially: if review is good, do it continuously while writing — not after.

They're right that some of what we do in PR review should have happened earlier — in a design doc, an RFC, a pairing session. The alignment problem doesn't have to wait for the diff. But the diff is where the design doc meets the codebase, and that contact surface keeps surfacing things the doc didn't predict. The bot makes that contact surface cheaper to inspect. It doesn't replace the inspection.

Why I Still Read the PR

What the bot is good at is the artifact. What it's not good at — and what I'm not sure it can be good at, structurally — is the experience of having reviewed. When the bot flags that this PR duplicates a helper from three weeks ago, the helper still gets de-duplicated. But I don't update my internal model of the codebase. I don't carry away the small jolt of recognition that lodges in working memory the way reading code does. The bug gets fixed. The team doesn't get any smarter.

That's the part I think is hard to outsource without losing something real. The alignment isn't in the comment thread. It's in the heads of the engineers who read the diff. A team that lets the bot handle 100% of review will produce a clean codebase staffed by people who don't know what's in it. Six months later the second-order effects show up — conflicting design choices in adjacent modules, three reimplementations of the same retry helper, an architectural drift nobody saw because nobody was reading.

So I've stopped framing it as should we still review PRs. The honest framing is: where in the team's day does alignment actually happen, and how much of it can we afford to delegate? Fowler's ship / show / ask framing is a saner starting point than the all-or-nothing version of the debate. Most changes can ship without review. Some can be shown after the fact. A smaller set need a human pair of eyes before — and not because the bot will miss the bug, but because the team needs to lay that brick consciously.

A year of working with AI inside review loops has sharpened one belief for me: the question was never whether the machine can find a defect faster than I can. It usually can. The question is whether the team I work with still knows the codebase it owns. PR review, even in the age of bots that read it first, is still the place where that knowledge gets built — one brick at a time, in the heads of the people doing the reading.

The bug got fixed by the bot. The team got built by the engineer who read the PR anyway.