The Problem with AI Code Review

AI code review is getting good at finding the wrong things. We ran it on three real PRs and compared the flags to what senior reviewers actually cared about.

AI code review has gotten genuinely competent at certain things. It will catch undefined variable references, flag obvious security anti-patterns, and notice when you're importing something you're not using. These are not the things that slow down real code reviews.

We ran an experiment over four weeks. Three closed PRs from a production TypeScript service. We ran each through two AI review tools — names withheld not for diplomatic reasons but because the specific tools will change faster than this analysis — and compared their flags to what our most experienced reviewer actually commented on.

What the Reviewers Cared About

The senior reviewer's comments clustered around three things: data model decisions, error handling philosophy, and the surface area of new abstractions. Not syntax. Not naming conventions (beyond functional clarity). Not import order.

In PR #1 — a new webhook endpoint — the senior reviewer's key comment was that we were persisting the raw webhook payload instead of a normalized version, and that this was going to cause a schema migration in three months when the vendor changed their format. This took two sentences and saved us real work.

What the AI Reviewers Found

Both AI tools flagged: missing input validation (correct, and also already handled by middleware the PR touched but they didn't read), a variable named data that could be more specific (correct, not important), and a missing null check on an optional field (correct, worth fixing).

Neither flagged the persistence decision. Neither noticed the vendor payload structure. Neither could, because understanding why this decision was bad required knowing things that weren't in the diff.

The Pattern

This held across all three PRs. The AI tools were reliable at local correctness — things verifiable within the diff or close to it. The senior reviewer was irreplaceable at systemic correctness — things that required understanding the system the code was joining.

This is not a surprising finding. But it has a practical implication that I think is often missed: AI code review doesn't make senior reviewers less necessary, it makes them more focused. If an AI catches the null check, the senior reviewer doesn't have to. That's good. But if teams treat AI approval as a substitute for senior review, they're optimizing for local correctness and leaving systemic correctness unreviewed.

What We Actually Changed

We use AI review as a first pass now — automated, runs on every PR. It catches the mechanical stuff. Senior review is triggered by: PRs touching the data model, any new abstraction boundary, anything touching auth or payments. Everything else gets async review from a peer, which has always been faster than a senior-review gate.

The result: senior reviewers spend less time on small PRs. They are more available for the ones that matter. Senior review latency went down because there are fewer queued items.

That's the actual win. Not "AI replaced code review." AI changed what code review is gated on.

One More Thing

The best AI review flag across all three PRs was a comment that one of our try/catch blocks was catching Error and logging it, but the error was then being swallowed — the function returned a default value as if nothing had happened. The AI tool flagged this as "potential silent failure." It was right. Our senior reviewer had not caught it.

So: it's not useless. It's just not what's advertised.