Code review

Tree-sitter

Tree-sitter is an open-source incremental parser library (born at GitHub) that builds concrete syntax trees for source code in dozens of languages. AI code-review tools use it to extract symbols, calls, and dependencies for context-aware reasoning.

What tree-sitter is

Tree-sitter is a parser generator + incremental parser runtime, originally built by Max Brunsfeld at GitHub to power semantic highlighting and code navigation. It's not a single parser — it's a framework that compiles language-specific grammar files into fast, error-tolerant parsers that produce concrete syntax trees (CSTs).

Each language gets its own grammar file (typically 1-5k lines of declarative grammar rules) which compiles to a C parser shared library. Tree-sitter ships with grammars for 40+ languages out of the box, with community contributions adding more.

The key innovations: incremental re-parsing (only re-parse the changed region of a file, not the whole thing) and error recovery (parser keeps going past syntax errors instead of bailing). Both matter for editors and code-analysis tools that operate on in-progress code.

Why tree-sitter beats regex / language-specific parsers for code analysis

Regex on source code is brittle. A regex for 'function declarations' catches some, misses others, and fires on string literals and comments that happen to look like function signatures. The maintenance burden grows with every edge case.

Language-specific parsers (using each compiler's own AST library) work but require integrating N different parsers for N languages. A multi-language code-review tool would need to embed and update parsers for TypeScript, Python, Go, Rust, Java, etc. — each with different update cadences and APIs.

Tree-sitter solves both: one library, one API, every language. The CSTs are uniform structurally, so you can write queries (using tree-sitter's S-expression query syntax) that work across languages with small adjustments.

Tree-sitter queries: the underrated power

Tree-sitter ships a Lisp-like query language for matching patterns in syntax trees. A query like `(function_declaration name: (identifier) @function-name)` matches every named function declaration in a file and binds the name to a `@function-name` capture group.

Multi-language code analysis tools write language-specific queries that map to language-agnostic concepts (functions, classes, imports, calls). Each language gets ~20-50 lines of query definitions. Adding a new language to a tree-sitter-based tool is therefore a day's work, not a week's.

The query engine is fast — capable of running thousands of queries per second on a large file. That's why editors use tree-sitter for live highlighting and code-folding.

Where tree-sitter shines for AI code review

Diff-only AI code review (sending the diff and only the diff to an LLM) misses context. The LLM doesn't see the function that's called from a changed line, the type definition that's referenced, the convention used elsewhere in the file.

Tree-sitter lets a review tool extract symbols (functions, classes, types) from the entire repo, build a symbol graph, and at review-time pull in just the relevant symbol bodies for the changed code. This 'context indexing' is what separates context-aware AI code review from naive diff-only review.

The performance characteristics work: parsing a 5,000-file monorepo takes about 50 seconds with tree-sitter; incremental re-parsing on a push of 10 changed files takes under a second.

Tree-sitter on LGTM

LGTM's context indexer uses tree-sitter for 12 languages: TypeScript, JavaScript, Python, Go, Rust, Java, Ruby, C, C++, C#, Swift, Kotlin. When you connect a repo and click 'Index Codebase', we walk the repo with the appropriate grammar per file, extract symbols (names + signatures + line ranges, NOT source bodies), and build a graph of who-calls-whom and who-imports-what.

Incremental updates run on every push to default branch — only the changed files get re-parsed, only the affected graph edges get re-ranked. Index stays fresh without rebuilding.

The 'never store source bodies' guarantee comes directly from tree-sitter's separation of structure from content: we keep what tree-sitter captures (symbol names, signatures), not the function bodies themselves. Bodies get fetched from GitHub fresh at review time, held in worker memory for the LLM call, then discarded.

Examples

Tree-sitter query example (TypeScript)

; Match exported function declarations
(export_statement
  (function_declaration
    name: (identifier) @function-name
    parameters: (formal_parameters) @params)) @export

; Match class methods
(class_declaration
  name: (type_identifier) @class-name
  body: (class_body
    (method_definition
      name: (property_identifier) @method-name) @method))

See LGTM's tree-sitter indexer in action

12 languages · incremental updates · symbol-graph not stored as source

Go to the product page

FAQs

Is tree-sitter the same as an LLM-based code parser?

No — tree-sitter is deterministic and rule-based. Same input, same tree, always. LLM-based 'parsing' uses a model to interpret code and is non-deterministic. Tree-sitter parsers run in microseconds; LLM calls take seconds. For static analysis, tree-sitter is the right tool; for semantic understanding, LLMs come on top.

Does tree-sitter need a build step?

Yes — each language grammar compiles to a shared library (.so / .dylib / .dll). Most tree-sitter integrations ship pre-compiled binaries per language so you don't need to build at runtime. Node.js, Python, Rust bindings are all available.

How accurate is tree-sitter on broken code?

Very. Tree-sitter is designed for editors that need to parse in-progress code with syntax errors. It produces a tree even when the input is partially malformed, marking the broken regions with ERROR nodes. This is much better than tools that bail on the first syntax error.

Which languages does tree-sitter NOT support well?

Hand-rolled parsers still beat tree-sitter for some languages with extreme ambiguity (Perl, classic C++ templates) or non-context-free grammars. For everything else — TypeScript, Python, Go, Rust, Java, C#, Ruby, Swift, Kotlin, Lua, Bash, SQL, YAML — tree-sitter grammars are at or above hand-rolled quality and easier to maintain.

Related across LGTM

LGTM's context indexer

How tree-sitter + PageRank build the per-PR context window.

TS code review

What tree-sitter coverage looks like for TypeScript projects.

Python code review

Tree-sitter Python grammar handling — Django, FastAPI, type hints.

Monorepo onboarding

Why tree-sitter is the right primitive for cross-package monorepo review.

Related terms

PageRank in code review context

PageRank applied to code: rank symbols in a repo by how 'important' they are to a given change. Built on the same algorithm Google used for web pages — but the graph is the call/import structure of your codebase. Outperforms vector embeddings for code-context retrieval.

AI code review

AI code review uses large language models (and increasingly multi-agent pipelines) to review pull requests for bugs, security issues, performance regressions, readability, and style — automatically, on every PR, in 30-90 seconds.

LLM code review pipeline

An LLM code review pipeline is the end-to-end system that ingests a GitHub PR webhook, fetches the diff, gathers repo context, runs one or more LLM agents in parallel, synthesizes their outputs, and posts the result back to GitHub as a review with inline comments.

Back to all glossary terms