Code review

LLM code review pipeline

An LLM code review pipeline is the end-to-end system that ingests a GitHub PR webhook, fetches the diff, gathers repo context, runs one or more LLM agents in parallel, synthesizes their outputs, and posts the result back to GitHub as a review with inline comments.

The full pipeline — stage by stage

Stage 1: Webhook receipt. GitHub fires pull_request (opened, synchronized, reopened, closed) to the review tool's HTTPS endpoint. The endpoint verifies the X-Hub-Signature-256 HMAC and deduplicates by X-GitHub-Delivery so retry storms don't trigger duplicate reviews.

Stage 2: Gating. Before enqueuing work, the tool checks billing quota, repo activation state, user consent (if applicable for legal/compliance reasons), and rate limits. Atomic reservation on the quota counter prevents two simultaneous PRs from double-spending the same slot.

Stage 3: Diff parsing. Worker fetches the PR diff via GitHub App installation token (which itself has 15-second timeout + retry semantics for resilience against GitHub API blips). Parses unified diff into per-file hunks with line numbers.

Stage 4: Context gathering. If the repo is indexed, tree-sitter symbol graph is queried for symbols referenced by the diff. Personalized PageRank ranks them. Top 5-15 symbol bodies get fetched fresh from GitHub and attached to the relevant agent prompts.

Stage 5: Agent execution. One BullMQ child job per agent. Each gets the diff + context slice + system prompt scoped to its lens (Bugs / Security / etc.). LLM call runs on the user's BYOK key.

Stage 6: Synthesizer. Receives all six reports + diff + custom guidelines. Produces the final verdict + summary + dedup'd inline comments mapped to (file, line).

Stage 7: GitHub posting. Single POST to /repos/{owner}/{repo}/pulls/{pr}/reviews with the verdict and inline comments. 422 retry without comments if GitHub rejects due to line-not-found (typical for force-pushed PRs mid-review).

Why queue-based architecture matters

Code reviews are bursty. A repo can get 1 PR a day for a week, then 30 PRs in an hour during sprint-end. Naive 'spawn-a-thread-per-review' breaks under bursts.

Queue-based pipelines (BullMQ on Redis, AWS SQS, GCP Pub/Sub) decouple webhook receipt from worker capacity. Webhook handler does the gating + enqueue and returns 200 in milliseconds; workers pull jobs at their own pace. Backpressure handles itself.

Backoff + retry is also queue-level. Failed jobs go to a delayed queue with exponential backoff before retry. Stuck jobs (no progress in N minutes) get marked failed and emit a notification to the user so they aren't waiting on an opaque spinner forever.

Real-time progress streaming

A code review takes 30-90s end-to-end. Without progress feedback, the user opens the PR, looks for the AI comment, doesn't see it, refreshes, sees the spinner, refreshes again, eventually sees the comment. Bad UX.

Better: Socket.IO (or Server-Sent Events) streams progress events to the dashboard as the pipeline progresses. Events like review:started, agent:started (per agent), agent:completed, synthesizer:started, review:completed.

The user watches the agents finish in real-time. Adds a UX warmth that's hard to articulate but matters — the system feels alive, not opaque. Also: if an agent fails, the streaming event surfaces it immediately instead of leaving the user staring at a hung review.

Failure modes + retry semantics

Stage 1 (Webhook) failures: signature mismatch → 401 to GitHub (which retries). Duplicate delivery → 200 (idempotent).

Stage 2 (Gating) failures: quota exhausted → return a friendly 'upgrade' notification, don't enqueue. Auth failure → 401, no review.

Stage 3 (Diff fetch) failures: GitHub API 5xx → retry with exponential backoff (3 attempts, 1s/4s/16s + jitter), then mark job failed. GitHub API rate limit → respect X-RateLimit-Remaining, defer the job.

Stage 4 (Context) failures: tree-sitter parsing fails on edge-case syntax → log + fall back to diff-only review (no context attached). Acceptable degradation; the review still ships.

Stage 5 (Agent) failures: LLM provider 429/503 → retry within agent (the LLM service layer handles its own backoff). LLM returns malformed JSON → ask once more with a stricter prompt, then fail the agent. Other agents continue.

Stage 6 (Synthesizer) failures: works on whichever agents succeeded; produces a degraded review noting which agents failed.

Stage 7 (GitHub post) failures: 422 (line not found, usually force-push race) → retry without inline comments. 5xx → retry with backoff. Permanent failure → mark job failed, notify user with the actual error.

Pipeline design choices that matter

Per-agent system prompts: each agent should focus on ONE lens. Mixing concerns in a prompt drops quality. Bugs agent shouldn't be asked about style.

Structured output: agents return JSON with a fixed schema, not freeform text. Makes the synthesizer's dedup logic deterministic and avoids parsing errors.

Idempotency: webhook delivery IDs deduplicated. PR review re-runs don't post duplicate inline comments — they update or replace previous review.

Cost control: per-repo focus areas (only run agents that matter for that repo) cut token spend. Per-repo model overrides (pin Haiku on cheap repos, Opus on critical ones) tune cost/quality.

Observability: every agent run logged. Per-agent latency tracked. Per-finding precision can be computed by joining 'AI flagged X' with 'human resolved X as true/false-positive'. This data feeds back into prompt iteration.

See LGTM's full review pipeline architecture

BullMQ + Redis · 6 parallel agents · synthesizer · Socket.IO streaming

Go to the product page

FAQs

How is this different from a single 'just ask GPT-4 to review the PR' approach?

Single-prompt code review works for small PRs and surface-level findings. A real pipeline handles: bursty load (queue + workers), 6 concerns in parallel (multi-agent), repo context (indexer), real-time streaming UX, retries on failure, idempotency on webhooks, dedup'd output, billing/quota gating, and per-repo customization. The single-prompt approach is the equivalent of running a SQL query against MongoDB on every page load — fine until you grow.

What's the typical latency budget?

End-to-end target: under 90 seconds. Webhook → enqueue: <100ms. Worker pickup: <2s. Diff fetch: 1-3s. Context retrieval: <1s. Agents (parallel): 10-60s gated by slowest agent. Synthesizer: 5-15s. GitHub post: 1-3s. The dominant time is the agents themselves — driven by LLM provider latency and diff size.

Can the pipeline handle force-pushes mid-review?

Yes, but it's edge-case work. If the PR's head SHA changes after the diff was fetched, the inline comments referencing old line numbers may be invalid. GitHub returns 422 on the review POST. The pipeline retries without inline comments, posting just the summary. The next review (triggered by the new push) gets the up-to-date diff.

How does the pipeline scale to monorepos with thousands of PRs/day?

Two scaling dimensions. Horizontal: more worker pods, larger queue. Per-job optimisation: pre-filter PRs (skip docs-only PRs, skip lockfile-only PRs), cache provider responses on identical diffs, batch agent prompts where allowed by the provider API. At 1k+ PRs/day the cost optimization work pays for itself within weeks.

What happens if my BYOK key runs out of credit mid-review?

The agent fails with a 402 / quota-exhausted from the provider. The pipeline marks that agent failed, synthesizer runs on remaining agents, the user gets a degraded review + notification suggesting they refill their AI provider account. The pipeline doesn't burn LGTM's resources retrying a key that's known to be out of credit.

Related across LGTM

Related terms