Code review

PageRank in code review context

PageRank applied to code: rank symbols in a repo by how 'important' they are to a given change. Built on the same algorithm Google used for web pages — but the graph is the call/import structure of your codebase. Outperforms vector embeddings for code-context retrieval.

PageRank — quick recap

PageRank, the algorithm Larry Page and Sergey Brin published in 1998, treats the web as a directed graph where pages are nodes and hyperlinks are edges. The score for each page reflects how many high-ranked pages link to it, computed via iterated matrix multiplication until the values stabilize.

The result: a global importance ordering. Pages cited by many other important pages rank highest. The genius of PageRank wasn't the math (Markov chains were old); it was applying it to web-link structure to get a robust importance signal.

Personalized PageRank is a variant where the 'restart probability' biases toward a specific subset of nodes. Instead of asking 'what's globally important?', it asks 'what's important relative to THIS starting set?'

Code as a graph

A codebase is naturally graph-shaped. Functions call other functions. Files import other files. Classes extend classes. Types reference types. Modules depend on modules. Every one of these relationships is a directed edge from a 'caller' (depends on) to a 'callee' (depended on).

Tools like tree-sitter let you extract these edges deterministically. The result is a symbol graph: 10k-100k nodes for a mid-sized repo, hundreds of thousands of edges. Plain graph data structures — exactly the substrate PageRank operates on.

What personalized PageRank gives you for code review

When a PR touches a function called `validateUser()`, you want a code reviewer (AI or human) to also see the things `validateUser()` itself calls, the types it accepts, the conventions used by similar functions in the repo. That's the 'context' that turns a generic review into a repo-aware review.

Set the seed weight on the PR's changed symbols. Run personalized PageRank with a damping factor around 0.85 (standard) and a few iterations. The top-ranked 5-15 symbols are the ones structurally most relevant to the change — they get pulled into the LLM prompt at review time.

Crucially, this works without semantic understanding. PageRank doesn't care what the code MEANS. It cares about how it's connected. The result is fast (linear-ish in graph size), deterministic, and gives the same answer for the same code structure every time.

PageRank vs vector embeddings for code retrieval

The 'standard' way to do context retrieval for AI is vector embeddings: chunk the code, embed each chunk, store in a vector database, retrieve top-k cosine-similar chunks at query time. Works well for documentation and FAQs.

It works less well for code for four reasons:

(1) Structural relationships are invisible to embeddings — two functions can be 'far apart' in embedding space but one literally calls the other.

(2) Chunking destroys context — a 200-line function split across chunks, the retrieved chunk might be the middle of a function, useless.

(3) Repo-specific vocabulary doesn't embed — internal helper names, project jargon — embeddings trained on general code don't 'understand' these.

(4) Re-embedding on every push is expensive: either you pay per-token continuously, or you let the index go stale.

PageRank fixes all four. It operates on the graph (1). Whole symbols are nodes, no chunking (2). Repo structure is intrinsic to the graph, not learned (3). Incremental updates are linear in changed edges (4).

PageRank on LGTM's context indexer

LGTM's context indexer extracts the symbol graph via tree-sitter, then runs personalized PageRank with the PR's changed symbols as seeds at every review time. Damping is 0.85 with 30-50 iterations to convergence.

The top-ranked 5-15 symbols beyond the diff get their function bodies fetched (from GitHub, not from our DB) and attached to the relevant agent's prompt. The agent reasons about the change with access to the structurally-most-relevant context, not random embedding neighbors.

Cost: zero per-review. PageRank runs on our worker, not on tokens. The same graph that took 50 seconds to build at index time can be re-ranked in well under a second per review.

Read about LGTM's PageRank-driven context

Deterministic · structurally-aware · zero LLM cost per re-rank

Go to the product page

FAQs

Doesn't PageRank's 'importance' just mean 'called from many places'?

Largely yes — but weighted. A function called from 50 other heavily-called functions ranks higher than one called from 50 leaf utilities. It captures hierarchical importance in a way that's robust to the natural fan-out shape of real codebases.

How is personalized PageRank different from regular PageRank?

Regular PageRank gives a single global score per node. Personalized PageRank biases the random-walker's restart probability toward a specific seed set, so the resulting scores represent 'importance from the perspective of these seeds.' For code review, the seeds are the PR's changed symbols — meaning the scores reflect what's relevant to THIS PR specifically, not the repo overall.

What if a PR touches code with no clear graph structure (e.g., a new file)?

PageRank still works but produces a flatter ranking — the new symbol has no incoming edges yet, so its 'importance' is low. The retrieval falls back on file-system proximity (same directory) and import structure (what does the new file import?). That's fine for the AI agents; new files genuinely don't have much repo context to bring.

Can PageRank handle a 100k-file monorepo?

Yes. PageRank computation is linear in edge count, and edge count grows roughly linearly with symbol count. A monorepo with 100k files and a few million edges converges in a few seconds. The bottleneck for monorepos is the initial indexing time (tree-sitter parsing), not the ranking.

Related across LGTM

Context indexer architecture

Tree-sitter symbol graph + personalised PageRank in the LGTM stack.

Monorepo onboarding

Why PageRank scales to 100k-file monorepos without per-PR re-embedding.

Related terms

Tree-sitter

Tree-sitter is an open-source incremental parser library (born at GitHub) that builds concrete syntax trees for source code in dozens of languages. AI code-review tools use it to extract symbols, calls, and dependencies for context-aware reasoning.

AI code review

AI code review uses large language models (and increasingly multi-agent pipelines) to review pull requests for bugs, security issues, performance regressions, readability, and style — automatically, on every PR, in 30-90 seconds.

LLM code review pipeline

An LLM code review pipeline is the end-to-end system that ingests a GitHub PR webhook, fetches the diff, gathers repo context, runs one or more LLM agents in parallel, synthesizes their outputs, and posts the result back to GitHub as a review with inline comments.

Back to all glossary terms