Code review

AI code review

AI code review uses large language models (and increasingly multi-agent pipelines) to review pull requests for bugs, security issues, performance regressions, readability, and style — automatically, on every PR, in 30-90 seconds.

What 'AI code review' actually does

AI code review tools sit between a PR being opened and a human reviewer reading it. When the PR opens, the tool fetches the diff, sends it to an LLM (often with relevant context — function bodies the diff references, conventions used elsewhere in the repo, recent PR summaries), and posts a review verdict with inline comments anchored to specific lines.

The output is structured: an overall verdict (approve / request_changes / comment), a short summary explaining the decision, and inline comments mapped to (file, line). Most tools also report a confidence score so humans know when to second-guess.

Some tools post-and-move-on. Better tools integrate with GitHub Check Runs so branch-protection rules can refuse merges based on the verdict. A few add real-time streaming of the review process so reviewers can watch the agents work.

Single-agent vs multi-agent architectures

Single-agent: one big LLM prompt covers everything (bugs + security + style + perf + docs in one pass). Simpler to build. Cheaper per review. Quality ceiling is whatever fits in one prompt — and a generalist prompt asking for everything tends to find less of each thing.

Multi-agent: each concern gets its own specialist agent — Bugs, Security, Performance, Readability, Best-practices, Documentation. Each runs in parallel with its own system prompt + finding schema. A separate synthesizer reads all six reports and produces a unified verdict.

The multi-agent trade-off: 6x more LLM calls per review, higher cost. The win: each agent focuses attention on one lens. Specialists outperform generalists when the failure modes are diverse — and code review has very diverse failure modes.

Why context matters more than model size

A frontier model (GPT-4o, Claude Opus, Gemini Pro) reviewing 100 lines of diff with no other context will catch the obvious bugs but miss the ones that depend on the rest of the codebase.

Example: a PR adds `await fetchUser(id).email`. Without context, the AI sees nothing concerning. With context (the indexer tells the agent that `fetchUser` returns `User | null`), the agent flags the null deref.

Better-than-naive AI code review tools build a code-context index (tree-sitter symbol graphs + PageRank-personalised retrieval) so the agent sees the relevant slice of the repo, not just the diff. This makes mid-tier models on indexed repos better than frontier models on non-indexed repos.

Where AI code review excels vs falls short

Excels at:

(1) Naming + readability — flagging unclear variable names, dead code, overly nested control flow. Patterns are diverse but local to the diff.

(2) Common bug patterns — null derefs, missing await, off-by-one. Pattern recognition is exactly what LLMs are good at.

(3) Security smells — SQL injection, XSS, hardcoded secrets. These follow recognisable patterns the model has seen thousands of examples of.

(4) Missing docs — JSDoc on exported functions, stale comments contradicting code.

(5) Style + convention enforcement — language idioms, framework conventions, repo-specific patterns (with custom guidelines).

Falls short at:

(1) Architectural judgement — should this code exist at all? Should this be a class or a function? AI doesn't have product-context.

(2) Cross-cutting concerns spanning many files — refactor opportunities, layering violations.

(3) Business-logic correctness — does this match the requirements? AI doesn't know your requirements.

(4) Subtle race conditions — sometimes catches, often misses. The cost of false negatives is high here.

Treat AI code review as the first pass that filters the bottom 80% of review work, freeing humans to do the architectural judgement layer.

Cost economics — BYOK vs vendor-managed

A typical 300-line PR review on GPT-4o costs $0.05-$0.15 in API tokens. Multiply by review volume — 50 PRs/week, 10 engineers — and you're at $200-$600/month in pure token cost.

Vendor-managed AI code review tools (where the vendor pays the provider, marks up, charges you flat or per-seat) typically add 30-50% margin on top of provider cost. Expect $260-$900/month in that model for the same workload.

BYOK (bring-your-own-key) models give the user direct billing from OpenAI / Anthropic / Google with zero vendor markup on tokens. The SaaS just charges for the orchestration layer (typically ₹399-2000/month flat or per-seat). For high-volume teams, BYOK + flat vendor fee is significantly cheaper than vendor-managed AI.

The per-PR token cost can be tuned aggressively — switching from GPT-4o ($0.10/PR) to Claude Haiku ($0.01/PR) drops cost 10x. Mid-tier models with good context indexing often outperform frontier models without context.

See LGTM's 6-agent code review pipeline

6 specialists · synthesizer · BYOK · 30-90s end-to-end

Go to the product page

FAQs

Does AI code review replace human reviewers?

No. It replaces the BOTTOM 80% of review work — stylistic nits, naming, obvious bugs, common security smells, missing docs. Humans focus on architecture, trade-offs, business-logic correctness, mentorship. The math is leverage: senior eng reviewer time is the most expensive thing in your eng org; AI catches what doesn't need it.

How accurate is AI code review?

On well-known bug patterns (null derefs, missing await, SQL injection, XSS): 80-90% precision typical for frontier models with context. On novel logic bugs: significantly lower. The right framing isn't 'is it accurate?' but 'is it accurate enough that humans benefit from the first pass?' — which it generally is.

Can AI code review create false positives that annoy developers?

Yes. The most common failure mode is over-commenting on trivial stylistic stuff. Better tools surface a confidence score and limit inline comments to high-confidence findings; lower-confidence stuff goes into the summary, not as inline noise. Per-repo focus areas (let the team pick which lenses to enable) help reduce false-positive volume.

What's the speed expectation?

30-90 seconds end-to-end for a typical PR (300-500 lines, frontier model). Larger PRs scale linearly until the model's context window — at which point the tool chunks the diff. Real-time streaming via Socket.IO lets reviewers watch the agents complete rather than stare at an opaque spinner.

Does my code get used to train AI models?

Depends on the provider. OpenAI, Anthropic, Google all have enterprise / API-use clauses that exclude training. ChatGPT consumer-tier and similar end-user products DO train on inputs by default — those should never be used for code review. For BYOK setups using API keys, you're under the API terms (no training). Vendor-managed setups depend on the vendor's contract with the provider.

How do I evaluate an AI code review tool?

Connect to a repo with substantial recent PR history. Look at the AI's verdict on 10-20 of those PRs (you already know what the human reviewers found). Compare: did the AI catch what the humans caught? Did it surface anything new? Is the false-positive rate tolerable? Repeat with a different code style / language to test generalization.

Related across LGTM

Related terms