Tree-sitter + PageRank = reviewers that actually understand your codebase
Most AI code reviewers see only the diff hunks. They miss the function called three frames up, the convention used everywhere else in the repo, the same bug fix that was already shipped in another file. LGTM's context indexer parses your repo with tree-sitter, builds a symbol graph, ranks it with PageRank, and feeds the relevant slice to every review agent — so the agents reason about your codebase, not generic patterns.
The generic-review problem
Imagine an LLM reviewing this PR diff:
+ const user = await fetchUser(req.params.id);
+ res.json({ email: user.email });Without context, the reviewer sees nothing concerning. Two lines of code. They'll suggest adding a try / catch and move on.
But fetchUser() in your codebase returns User | null from the cache layer — and a null deref on .email already caused a P1 incident last quarter. The reviewer missed it because it never looked at fetchUser.
Same diff, indexed repo
The Bugs agent receives the diff + this:
// from src/lib/cache.ts (indexed)
export async function fetchUser(
id: string
): Promise<User | null> {
return cache.get(`user:${id}`) ?? null;
}Now the verdict reads:
"fetchUser() can return null on cache miss (line 42 of cache.ts). The new code derefs .email without a guard — likely null-pointer error in production. Suggest: if (!user) return res.status(404).end()."
How indexing actually works
Three stages. First one runs when you click "Index Codebase". The other two keep the index fresh on every push, incrementally.
Parse
Tree-sitter AST extraction
We walk your repo with a per-language tree-sitter grammar. For each source file we extract: exported function names + signatures, type/class definitions, call sites (who calls whom), and imports (who depends on whom).
Tree-sitter is fast — about 10ms per file on a modest CPU. A 5,000-file monorepo finishes the parse stage in ~50 seconds. We never store full source bodies; only symbol names, signatures, and the line range they occupy.
Graph
Symbol-and-call graph
The parsed symbols feed into a directed graph: nodes are symbols (functions, types, modules), edges are relationships (calls, imports, extends). One repo typically yields 10k-100k nodes for a mid-sized codebase.
The graph captures the "who-uses-what" map of your code. When a PR touches fetchUser, we can instantly answer "what 23 places call this?" — without re-scanning the repo.
Rank
PageRank-personalized retrieval
Symbols touched by a PR get a high "seed" weight. We then run personalized PageRank — the same algorithm Google used to rank web pages — across the symbol graph. Symbols with strong call/import relationships to the seed bubble up.
The top-ranked symbols (typically 5-15 of them) get their signatures + bodies attached to the review agent prompts. PageRank beats vector embeddings here because code structure is graph-shaped, not semantically clustered — a function called from 50 places is contextually important even if it doesn't semantically "match" the diff text.
Supported languages
Twelve languages today. Adding a language requires the tree-sitter grammar plus a per-language query file mapping syntax nodes to symbol kinds — about a day of work each. Open a request for a missing language at tarinagarwal@gmail.com.
TypeScript
.ts .tsxfirst-classJavaScript
.js .jsx .mjsfirst-classPython
.py .pyifirst-classGo
.gofirst-classRust
.rsfirst-classJava
.javafirst-classRuby
.rbstableC
.c .hstableC++
.cpp .hpp .cc .hhstableC#
.csstableSwift
.swiftbetaKotlin
.kt .ktsbetaFirst-class = full call-graph, convention extraction, and history integration. Stable = symbol extraction and call graph; convention extractor is light. Beta = symbol extraction only; expect rougher edges on metaprogramming-heavy code.
Two extractors that pull in unwritten rules
Beyond raw symbols, the indexer runs two domain-specific extractors that mine your repo for conventions and history. These feed the review agents the "how we do things here" signal that's otherwise impossible to capture.
Convention extractor
Scans the repo for recurring patterns and marks them as conventions:
- ·Preferred HTTP client (fetch vs axios vs ky)
- ·Error-handling pattern (throw vs Result vs callback)
- ·Logging library + log-level conventions
- ·Naming conventions for tests, mocks, types
- ·Common helpers (formatDate, requireAuth, etc.)
If a PR reaches for moment.js in a repo that uses date-fns everywhere else, the reviewer flags it as a convention violation — not as a generic "moment.js is deprecated" comment.
History extractor
Summarizes recent PR descriptions (last 50 merged PRs by default) and feeds them into the review prompts. Catches two important signals:
- ·Recurring themes ("we just refactored auth, be careful with session handling")
- ·Recent incidents ("P1 from null deref on fetchUser")
- ·Active migrations ("moving from Express to Hono — don't add new Express routes")
- ·Style decisions ("we decided no class components in last week's review")
Only the summaries are stored — not the full PR descriptions or commit content. Re-ranked every 50 PRs so the signal stays fresh.
What stays out of the index
Indexing has access to your source code — that's the point. Here's exactly what we keep and what we don't.
We keep
- Symbol names (function/class/type)
- Type signatures (params + return)
- Call relationships (A calls B)
- Import graph (file → file dependencies)
- File paths + line ranges
- Summarized PR history (last 50 PRs, ~100 words each)
We never keep
- ×Full source code bodies
- ×Comments or docstrings
- ×String/numeric literals
- ×Secrets, env vars, config
- ×Git commit messages or diffs
- ×Anything from .gitignored paths
When a review agent looks up fetchUser for context, the function body is fetched FRESH from GitHub at that moment using the installation token, held in worker memory for the duration of the review, and discarded. The body never lives in our database.
Performance characteristics
Rough numbers from production. Times are wall-clock on shared-cpu-1x workers in Fly Singapore.
| Repo size | Initial index | Incremental push | Index size |
|---|---|---|---|
| Small (< 100 files) | ~5s | < 1s | ~50 KB |
| Medium (100-1,000) | ~30s | 1-3s | ~500 KB |
| Large (1k-5k) | 1-3 min | 2-5s | ~2-5 MB |
| Monorepo (5k-20k) | 3-8 min | 5-15s | ~10-30 MB |
| Huge (> 20k) | contact us | varies | varies |
Incremental updates run on every push to default branch. They're differential — only changed files get re-parsed, only affected graph edges get re-ranked. The full index never needs to be rebuilt unless you click "Re-index" manually.
Why PageRank, not vector embeddings
Most "context-aware" AI code tools use vector embeddings — chunk the code, embed each chunk, retrieve the top-k cosine-similar chunks at query time. That works for documentation and FAQs. It works less well for code.
Where embeddings fall short for code
- Structural relationships are invisible. Two functions can be cosine-distant in embedding space but one literally calls the other.
- Chunking destroys context. A function spanning 200 lines gets split across chunks. The chunk that gets retrieved might be the middle of the function — useless.
- Repo-specific vocabulary doesn't embed well. Internal type names, custom helpers, project-specific jargon — embeddings trained on general code don't understand these.
- Re-embedding on every push is expensive. Either you re-embed (~$0.10 per push on large repos) or you let the index go stale.
Why PageRank wins
- Code is graph-shaped, not vector-shaped. Call sites, imports, inheritance — these are first-class relationships. PageRank operates on exactly that structure.
- Importance scales with usage. A utility function called from 50 places ranks higher than an unused helper, even if they look identical to an embedding.
- Personalized seeds. We can bias the random walker toward the PR's changed symbols, so context surfaces what's actually relevant to THIS PR.
- Cheap incremental updates. Re-rank costs are linear in changed edges, not the full graph. Push doesn't pay per-token.
Monorepos work the same way
We treat your monorepo as one workspace. The graph spans all packages, so a review on packages/api can pull context from packages/types if there's a call/import relationship.
The PR-scoped retrieval keeps token spend in check — a review on a 100-package monorepo doesn't pull context from all 100 packages, only the ones the diff actually touches and the symbols those reach.
Per-package config (different focus areas per workspace, different model overrides per package) is on the roadmap. For now, per-repo settings cover everyone.
FAQ
Do I have to index every repo?
How fresh does the index stay?
main every hour, the index is at most one hour stale.What happens if I disconnect a repo?
Can I see what's in my index?
Does indexing send my code to OpenAI / Anthropic / Gemini?
What if my repo has a custom language / DSL?
Will indexing slow down my pushes?
Index your repo. Get better reviews.
One click from the dashboard. Works in the background. Symbol-only — your source code never leaves GitHub.