Tree-sitter Integration

Why Aura embeds tree-sitter, how grammars are loaded, and how we handle parse failures gracefully.

Overview

Tree-sitter is a parser generator that produces incremental, error-tolerant parsers for every language worth supporting. Its grammars are maintained as open-source packages, battle-tested in editors (Neovim, Helix, Atom, GitHub's code navigation), and fast enough to re-parse on every keystroke. When Aura needs to turn a source file into a syntax tree, tree-sitter is the default answer.

This page is for engineers who want to know what "backed by tree-sitter" actually means in Aura's codebase: how grammars are vendored and loaded, how parsing failures are handled, what the accuracy guarantees are, and where tree-sitter ends and Aura's own logic begins.

A parser you can trust is a parser that fails loudly. Tree-sitter's error-tolerant design lets Aura degrade instead of lying.

How It Works

Why tree-sitter and not hand-rolled parsers

Three concrete reasons:

  1. Breadth. Every language Aura supports today has a maintained tree-sitter grammar. Building fifteen hand-rolled parsers would be a full engineering team for a year.
  2. Error tolerance. Tree-sitter parsers produce a best-effort tree even on syntactically broken input. The tree flags ERROR and MISSING nodes but keeps going. This is exactly what you want for merge — the input is often mid-edit, and a parser that aborts on the first syntax error is useless.
  3. Incremental parsing. Re-parsing a 10k-line file after a one-line edit is sub-millisecond. Aura leans on this during live sync to push merge deltas as a user types.

The tradeoffs are real but manageable: tree-sitter grammars encode concrete syntax, not abstract semantics. They do not do name resolution, type inference, or macro expansion. Aura's adapters (see cross-language AST) layer those on where needed.

Grammar loading

Grammars ship with Aura as static libraries, not runtime-loaded dynamic libraries. Each grammar is vendored as a pinned version, built during aura build, and linked into the binary. The manifest lives in crates/aura-parsers/Cargo.toml and looks roughly like this:

[dependencies]
tree-sitter              = "0.22"
tree-sitter-rust         = "0.21.2"
tree-sitter-typescript   = "0.20.4"
tree-sitter-python       = "0.20.4"
tree-sitter-go           = "0.20.0"
tree-sitter-java         = "0.20.2"
# ...

At runtime, a LanguageRegistry holds one Language handle per supported grammar, keyed by file extension and first-line patterns (shebangs, doctype declarations). Resolving the grammar for a file path is a HashMap lookup.

No network calls, no download-on-demand, no shell-out to the tree-sitter CLI. Grammars are part of the binary. This matters for reproducibility (the same Aura version produces the same parses on every machine) and for air-gapped deployments (common for enterprise installs).

The parse pipeline

For a given file:

source_bytes
    -> tree_sitter::Parser::parse(source, None) -> tree_sitter::Tree
    -> AuraAdapter::walk(tree)                  -> aura::AstNode[]
    -> AuraAdapter::identify(nodes)              -> (id, node)[]
    -> ready for diff

Incremental re-parse for a subsequent edit:

prev_tree + edit_range
    -> tree_sitter::Parser::parse(new_source, Some(prev_tree)) -> tree_sitter::Tree
    -> only changed subtrees are re-walked

Aura caches parse trees per file for the duration of a merge session. Ancestor, ours, and theirs are each parsed once and reused across every diff, resolution preview, and validation check.

Handling parse errors

A tree-sitter tree can include error nodes. Aura classifies a parse result as:

| Status | Criteria | Action | |---------------|----------------------------------------------------------|--------| | clean | Zero error / missing nodes | Full AST merge. | | recoverable | Errors localized to subtrees that neither side edited | Full AST merge on unaffected regions; text merge on affected subtrees. | | failed | Errors in subtrees that at least one side edited | Fall back to text merge for the file; emit a diagnostic. |

This grading is per-file and per-side. A file that parses clean on ancestor and ours but fails on theirs is handled as a text merge with a parse/parse conflict reason (see conflict resolution).

Accuracy guarantees

Tree-sitter grammars are expected to be correct on well-formed input. Aura backstops this with:

  • Round-trip tests. For every supported language, a corpus of real open-source files is parsed, pretty-printed, re-parsed, and compared. Structural equality is required; whitespace differences are allowed.
  • Golden merge tests. A directory of before/after merge scenarios covers common patterns per language. Any grammar update must pass the full golden suite before merging.
  • Runtime self-check. On first use per session, Aura re-parses its own pretty-printed output of each file it touches; if structures diverge, it falls back to text merge for safety.

Coverage is not 100%. Extremely new syntax (often at the edge of stable language versions) occasionally parses into broader node kinds than the adapter expects, producing a parse failure and a text-merge fallback. Adapter health (aura doctor --adapters) surfaces the current failure rate.

Injections: languages inside languages

Tree-sitter supports injections — parsing one language inside a region of another. Aura uses this for:

  • JSX and TSX (JavaScript/TypeScript with embedded JSX).
  • Vue single-file components (<template> HTML, <script> JS/TS, <style> CSS).
  • Svelte components.
  • Rails ERB (HTML with embedded Ruby).
  • Markdown with fenced code blocks (see below).
  • Comments with doc-string sub-languages (rustdoc, JSDoc, pydoc).

Injected regions get their own parse trees and participate in merge as first-class subtrees. An edit inside a <script> block of a Vue file is identified as a TypeScript function modification, not an anonymous text change.

Markdown as a special case

Markdown is parsed by tree-sitter-markdown into a block-level tree: headings, paragraphs, lists, fences, tables. Code fences are injected and parsed by the appropriate language grammar (javascript, rust, etc.).

This gives Aura block-level merge for prose documents and AST-level merge for embedded code examples. Two authors editing different paragraphs of the same README never conflict. Two authors editing the same fenced Python example get AST merge on the code.

Examples

A: Parser health in the terminal

$ aura doctor --adapters --verbose
checking 14 language adapters...

rust       [clean]  tree-sitter-rust       0.21.2   parsed 1,284 files (2 recoverable)
typescript [clean]  tree-sitter-typescript 0.20.4   parsed 2,107 files (0 failed, 7 recoverable)
python     [clean]  tree-sitter-python     0.20.4   parsed 416 files
go         [clean]  tree-sitter-go         0.20.0   parsed 308 files
java       [clean]  tree-sitter-java       0.20.2   parsed 91 files
yaml       [clean]  tree-sitter-yaml       0.5.0    parsed 54 files (1 recoverable)
json       [clean]  tree-sitter-json       0.20.0   parsed 198 files
markdown   [clean]  tree-sitter-markdown   0.3.0    parsed 73 files
bash       [clean]  tree-sitter-bash       0.20.4   parsed 22 files
kotlin     [warn ]  tree-sitter-kotlin     0.3.7    parsed 14 files (1 failed) ← out-of-date grammar
...

suggestions:
  - bump tree-sitter-kotlin to 0.3.8 to support `context receivers` syntax

B: Recoverable parse on a mid-edit file

A developer runs aura merge while their working tree has a partially-written function:

function charge(amount: number) {
  if (amount <= 0)
    throw new Error("
    // ... user was mid-typing when they triggered merge
}

Tree-sitter produces a tree with an ERROR node inside charge's body. Aura classifies this as recoverable: the error is inside charge, so if neither merge side edited charge, the rest of the file merges at AST level and charge is handled by text merge. If a side did edit charge, the file falls back to text merge with a diagnostic:

src/billing.ts: recoverable parse error inside `charge` (line 12)
  merge proceeded at AST level for 8 other declarations
  `charge` resolved with text merge

C: Incremental parse during an edit session

In live-sync mode, Aura keeps the parse tree hot:

t=0ms   initial parse of src/auth.ts (820 lines)   -> 3.1 ms
t=...   user types one line inside login()
t=...   incremental re-parse with edit range        -> 0.2 ms
t=...   delta computed: modify(login)               -> 0.1 ms
t=...   push to mothership                          -> < 1 ms

Incremental parsing is what makes real-time semantic sync practical. A full re-parse per keystroke would work for small files but fall over at 5k+ lines; tree-sitter's incremental mode keeps it flat.

D: Injection in a Vue file

<template>
  <button @click="submit">Pay</button>
</template>

<script setup lang="ts">
function submit() {
  charge(amount.value);
}
</script>

tree-sitter-vue parses the outer shell. The <script setup lang="ts"> region is injected into tree-sitter-typescript. Aura identifies submit as a TypeScript function with identity src/pay.vue#script/submit. A conflict on submit shows the TypeScript diff, not the whole <script> block.

Edge Cases

Grammar updates breaking identity. Occasionally a grammar update renames a node kind (class_declarationclass_definition). The adapter layer insulates Aura's core from this; a grammar bump ships with adapter patches and a migration note. Identity hashes are stable across these changes because they key on semantic fields (name, module path), not raw node kinds.

Very long files. Tree-sitter parses in memory. Single files above ~100MB can cause memory pressure; Aura caps parse attempts at a configurable size (default 32MB per file) and falls back to text merge with a diagnostic above that.

Files with mixed encodings. Tree-sitter expects UTF-8. Files in UTF-16 or legacy encodings are transcoded on input; byte offsets are translated accordingly in diagnostics.

Minified code. Minified JavaScript produces a valid but pathological AST — hundreds of thousands of comma-joined expressions in one statement. Aura detects this heuristically (average line length > 400 chars, lines > 10k chars) and defers to text merge.

Custom grammars. Some organizations use internal DSLs with no public tree-sitter grammar. Aura supports loading external grammars from a local path via .aura/grammars.json, but such grammars are not linked into the binary and do not participate in round-trip tests shipped with Aura. A local health check runs on first use.

Grammar licensing. All tree-sitter grammars Aura links are MIT or Apache-2.0 licensed. The manifest is audited as part of every release.

Grammars that shell out. A small number of grammars historically shipped with a Node.js build step. Aura uses only pre-built C sources, avoiding the JS dependency at runtime.

See Also