How AI Code Review Catches Bugs Across Files

May 6, 2026

Macroscope

Product

How AI Code Review Catches Bugs Across Files

Q: Which AI code review tool catches the most bugs?

On the public 118-bug benchmark, **Macroscope detected the most**: 48% of bugs at 98% precision, beating CodeRabbit (46%), Cursor BugBot (42%), Greptile (24%), and Graphite Diamond. The full per-language breakdown and methodology is published at [/blog/code-review-benchmark](/blog/code-review-benchmark).

Q: Is AI code review free?

Macroscope is **free for open-source repositories** and gives every new workspace **$100 in free usage** (no card required), which typically covers a few weeks of real PRs before you pay anything. Beyond that, Macroscope is usage-based at $0.05 per KB of diff reviewed. Per-seat tools like CodeRabbit ($24/dev/mo) charge whether code is reviewed or not.

Q: What is cross-file bug detection?

**Cross-file bug detection** is the ability of an **AI code review** tool to identify bugs whose root cause spans more than the files changed in a pull request. The diff might look fine, but a caller in a different file relies on the old behavior, a type definition somewhere else doesn't match the new shape, or a config consumer is reading the wrong default. Cross-file detection requires the reviewer to reason about the codebase, not just the diff.

Q: Does Macroscope only review the languages it has deep support for?

No. Macroscope reviews pull requests in **any codebase**. Agents have full read access to the repository, so reviews happen with broader context than the diff regardless of language. For eight languages (**Python, TypeScript, JavaScript, Kotlin, Java, Rust, Swift, and Go**) there is a deeper structural layer underneath that surfaces cross-file ripples (signature changes, type renames, control-flow gaps) more reliably.

Q: Which languages does Macroscope have deep support for?

**Python, TypeScript, JavaScript, Kotlin, Java, Rust, Swift, and Go.** In those languages, Macroscope analyzes code at the structural level (understanding how functions, types, and modules connect) in addition to the LLM-driven review every codebase gets.

Q: What is Approvability?

**Approvability** is a Macroscope feature that auto-approves PRs the system can confidently classify as safe (small, low-risk changes whose blast radius is contained). It's the inverse signal of high-precision detection: if the reviewer is confident it can flag bugs, it can also be confident when there are none. Opt-in per repository and tunable per file pattern.

Most bugs don't live in a single diff. They live in the gap between the diff and everything else in the repository. Why codebase-aware AI code review catches what diff-only tools miss. Last updated May 2026.

Last updated: May 2026.

Most code review tools are looking at the wrong picture. They read the diff, reason about the diff, and write comments about the diff. The bugs that matter most, the ones that take a service down at 3am, almost never live in the diff alone. They live in the gap between the diff and everything else in the repository: the caller two files away, the type defined in a shared package, the helper that quietly assumed a contract the diff just changed.

Cross-file bug detection is what makes AI code review actually useful. It is the difference between a tool that reads pull requests and a tool that understands them.

TL;DR

Single-diff review misses the bugs that matter (contract drift, type-graph ripple, concurrency invariants), because they live outside the diff.

Macroscope reads the full repository when reviewing a pull request. Diff is one input; the codebase is the rest.

48% detection rate, 98% precision on the public 118-bug Code Review Benchmark, the highest of any AI code reviewer tested.

Deeper AST analysis on 8 languages (Python, TypeScript, JavaScript, Kotlin, Java, Rust, Swift, Go), repo-context reviews on every other language.

Usage-based pricing: $0.05/KB of diff reviewed, $100 in free usage on every new workspace, no per-seat fees, free for open source.

Quick Answers

Is AI code review actually worth it?

Yes, when the AI reviewer reads the full codebase, not just the diff. Macroscope detected 48% of 118 production bugs in a public benchmark with 98% precision (nearly every comment is a real bug). That precision is what determines whether AI code review is worth it: a noisy reviewer wastes time, a high-precision one saves it. See the public benchmark and our AI code review vs human code review breakdown.

What is cross-file bug detection?

The ability to identify bugs whose root cause spans more than the files changed in a pull request. A function signature changes in one file; a caller in another file silently breaks. A diff-only reviewer cannot see that. A codebase-aware reviewer can.

How is AI code review different from static analysis?

Static analysis tools match patterns against code (regex, AST patterns, DSL rules) and find syntax-level issues. AI code reviewers reason semantically with the repository as context, so they catch contract drift and ripple effects that pattern matching cannot. Most teams use both.

Does AI code review replace human review?

No, it changes what humans focus on. AI catches the structural bugs and routine issues; humans focus on architecture, design, and judgment calls. Google's engineering practices guide frames code review as fundamentally about humans understanding the change. AI removes the parts that do not require human judgment.

Why Single-Diff Review Misses Cross-File Bugs (and How Codebase-Aware AI Catches Them)

Pull requests are a slice. They show what changed, not what depends on what changed. A 200-line diff might rename a struct field, refactor an error path, or tighten a function signature. The diff itself can look fine, and the bug is sitting in a file the PR doesn't touch.

A few common bug shapes that single-diff review can't see:

Caller-callee contract drift. A function signature changes; one of the seventeen callers still relies on the old contract.
Type-graph ripple. A struct field is renamed in one file. A serializer in another file builds JSON keyed off the old name. Production payloads now miss a field.
Conditional unreachability. The diff adds a branch. A switch in a different file is supposed to handle every case, and the new branch isn't covered.
Concurrency invariants. A lock pattern changes in one file; a helper in another file assumed the old pattern. A race appears under load.
Configuration coupling. A default changes in one file; a migration script in another directory relied on the old default and silently writes wrong values.

A reviewer that only reads the diff will not catch these. A senior human reviewer might, by remembering the rest of the codebase. Codebase-aware review is the same trick, automated, applied consistently to every PR.

This is well-documented in software engineering research. Bacchelli and Bird's seminal paper Expectations, Outcomes, and Challenges of Modern Code Review found that the primary value of code review is "understanding," which requires reasoning about how the change interacts with the rest of the system, not just inspecting the diff. The annual Stack Overflow Developer Survey consistently shows code review as the most-valued quality practice on engineering teams. The mechanic that delivers that value, in both human and AI reviewers, is full-system context.

Key Features of Macroscope's Cross-File AI Code Review

Macroscope's cross-file detection rests on a few specific capabilities. Each is what makes the difference between catching a structural bug and missing it.

Reference graph across the whole repository

Macroscope builds an Abstract Syntax Tree (AST) for every supported file, then assembles a reference graph that maps how every function, class, type, and variable relates to every other across the codebase. When a pull request changes a function, the reviewer can trace every caller, every dependent, and every type constraint in one pass.

Native AST codewalkers for eight languages

Dedicated parsers for Python, TypeScript, JavaScript, Kotlin, Java, Rust, Swift, and Go (TC39 for JavaScript, PEP-aligned for Python, the Go specification for Go, etc.). These languages get the deepest structural detection. In the public 118-bug benchmark, Macroscope detected 86% of Go bugs and 56% of Java bugs, the languages where AST parsing matters most.

Repo-context reviews on every other language

For languages without a native codewalker, Macroscope still reads the relevant files, not only the diff. Agents have full read access to the codebase and can navigate references, follow imports, grep, and read configuration. Cross-file context is preserved, just without the same structural depth.

Approvability for low-risk PRs

When the reviewer can confidently say "this PR has issues," it can also confidently say "this PR has none." Approvability auto-approves PRs whose blast radius is contained, dissolving queue time on the routine half of the backlog. Opt-in per repo, tunable per file pattern.

Fix It For Me with CI iteration

When Macroscope finds a bug, replying fix it for me opens a new branch, implements the fix, opens a pull request, runs your GitHub Actions CI, reads failure logs, and iterates until tests pass. No other AI code reviewer runs a CI retry loop on its own fixes.

Check Run Agents for custom enforcement

Custom rules defined in plain English markdown files in .macroscope/. Each agent runs as its own GitHub Check Run and can block merges via branch protection. Full repository context, not just the diff.

Usage-based pricing aligned with work, not seats

$0.05 per KB of diff reviewed, ~$0.95 historical average per review, $10 per-review and $50 per-PR caps. $100 in free usage on every new workspace, no card required. Free for open source. As AI coding agents push more pull requests per developer, per-seat tools quietly get more expensive per unit of work; usage-based pricing tracks the workload directly. Full breakdown: Usage-Based Pricing for Developer Tools.

Macroscope reads the whole repo, not just the diff

When a pull request lands on a repo where Macroscope is installed, the reviewer doesn't start with the diff. It starts with the repository. Agents have full read access to the codebase. They can navigate references, follow imports, grep, read configuration files, and look at git history. The diff is one input. The rest of the codebase is the rest of the inputs.

That's the part most diff-only tools skip. It's also the part that determines whether a comment ships with real context or with a guess.

Deeper analysis on eight languages

Macroscope reviews pull requests in any codebase. For eight languages (Python, TypeScript, JavaScript, Kotlin, Java, Rust, Swift, and Go) there is a second layer underneath: native AST analysis that understands the code at the structural level, mapping how functions, types, and modules connect across files.

For repos in those eight languages, that extra structural depth is what surfaces cross-file ripples reliably: signature changes that break callers, type renames that don't propagate, control-flow gaps that span files. For repos in other languages, Macroscope still reviews every PR with full repo context. Agents read the relevant files, not only the diff. Just without the same structural depth.

A small example

The cleanest way to see why cross-file analysis matters is to look at a real bug. From the public Macroscope code review benchmark dataset:

Repository: apache/commons-math (Java)
Bug: MathUtils.gcd detected zero operands by testing u * v == 0. For non-zero inputs like 65536 × 65536, the multiplication overflows to zero (a documented behavior of Java's signed int per the Java Language Specification), the method takes the zero-case path, and returns the wrong answer.

Read the diff alone, and the bug is hard to see. The author plausibly meant u * v == 0 as a shortcut for "either is zero." Most unit tests pass. The flaw is structural (integer overflow can produce a false zero) and it is only catchable if the reviewer knows enough about the surrounding types and call sites to recognize the input ranges that break it.

A diff-only reviewer guesses. A codebase-aware reviewer does not have to.

This is a representative example, not a cherry-picked one. The public 118-bug benchmark contains 117 more bugs drawn from real open-source repositories across 8 languages. The methodology, per-tool detection rates, and per-language breakdowns are published in full.

How this connects to the rest of Macroscope

Cross-file context isn't only for bug-finding. Several Macroscope features ride on top of it.

Approvability. When the reviewer can confidently say "this PR has issues," it can also confidently say "this PR has none." Approvability auto-approves low-risk PRs whose blast radius the system judges as contained. Opt-in per repo, tunable per file pattern, dissolves queue time on the trivial half of the backlog.
Check Run Agents. Custom rules in .macroscope/check-run-agents/*.md are written in plain English and enforced as real GitHub Check Runs. Each agent gets the same codebase context as the default reviewer, so a rule like "always log on this code path" can be checked across the repo, not just inside the diff.
The Macroscope Agent. When a change benefits from research instead of just review, the agent explores the codebase and answers questions about it: where a behavior is implemented, why a refactor is risky, what surfaces a given module touches.
Context-aware review. Cross-file context is the first layer. The Agent also pulls production context (Sentry, Datadog, Amplitude) and ticket intent (Jira, Linear) into the review, so a finding can reflect not just the rest of the repo but how the changed code behaves for real users.

The product surface is broader than the comment thread. The codebase-awareness underneath is what makes all of it work.

Try it on your own codebase

The fastest way to see what a codebase-aware reviewer surfaces is to run it on your code.

Install Macroscope on a GitHub repository in under two minutes.
New workspaces get $100 in free usage.
Open a PR. Macroscope reviews it on default settings, with full repo context behind every comment.
Add Check Run Agents in .macroscope/check-run-agents/*.md to enforce your team's conventions.
Turn on Approvability if you want auto-approval for low-risk PRs.

There are no seat fees. You pay for the work Macroscope actually does.

See what cross-file review surfaces on your code

Get $100 in free usage to run Macroscope on real PRs.

Frequently Asked Questions

Is AI code review actually worth it?

Yes, when the reviewer is high-precision and reads the full codebase. In the public 118-bug benchmark, Macroscope detected 48% of production bugs at 98% precision, meaning nearly every comment is a real, actionable issue. A low-precision reviewer wastes engineering time on false positives; a high-precision one saves time on the bugs it catches before merge. The deciding factor for whether AI code review is worth it on your team is the precision number, not the existence of AI at all. The Stack Overflow Developer Survey consistently ranks code review as the most-valued quality practice; AI raises the floor of what gets reviewed.

How does AI code review compare to human code review?

Complementary, not replacement. AI handles the consistent structural pass (cross-file ripples, contract drift, type-graph correctness) on every PR. Humans handle architectural judgment, design decisions, and product context. We cover the comparison in depth in AI Code Review vs Human Code Review. The classic research framing is Bacchelli and Bird's finding that the primary value of human review is understanding; AI handles the part of review that is structural verification, freeing humans for the understanding.

Which AI code review tool catches the most bugs?

On the public 118-bug benchmark, Macroscope detected the most: 48% of bugs at 98% precision, beating CodeRabbit (46%), Cursor BugBot (42%), Greptile (24%), and Graphite Diamond. The full per-language breakdown and methodology is published at /blog/code-review-benchmark.

Is AI code review free?

Macroscope is free for open-source repositories and gives every new workspace $100 in free usage (no card required), which typically covers a few weeks of real PRs before you pay anything. Beyond that, Macroscope is usage-based at $0.05 per KB of diff reviewed. Per-seat tools like CodeRabbit ($24/dev/mo) charge whether code is reviewed or not.

What is cross-file bug detection?

Cross-file bug detection is the ability of an AI code review tool to identify bugs whose root cause spans more than the files changed in a pull request. The diff might look fine, but a caller in a different file relies on the old behavior, a type definition somewhere else doesn't match the new shape, or a config consumer is reading the wrong default. Cross-file detection requires the reviewer to reason about the codebase, not just the diff.

Does Macroscope only review the languages it has deep support for?

No. Macroscope reviews pull requests in any codebase. Agents have full read access to the repository, so reviews happen with broader context than the diff regardless of language. For eight languages (Python, TypeScript, JavaScript, Kotlin, Java, Rust, Swift, and Go) there is a deeper structural layer underneath that surfaces cross-file ripples (signature changes, type renames, control-flow gaps) more reliably.

Which languages does Macroscope have deep support for?

Python, TypeScript, JavaScript, Kotlin, Java, Rust, Swift, and Go. In those languages, Macroscope analyzes code at the structural level (understanding how functions, types, and modules connect) in addition to the LLM-driven review every codebase gets.

Why does this matter compared to a tool that only reads the diff?

Most production bugs aren't visible in the diff alone. They appear when the change interacts with something elsewhere in the repository: a caller in another file, a type defined in a shared package, a config consumer two directories away. A reviewer that only reads the diff can't see those interactions. A codebase-aware reviewer can.

How is this different from static analysis tools?

Static analyzers match patterns against code (regex, AST patterns, or DSL rules). They find syntax-level issues like missing nil checks or deprecated API usage. Macroscope is a codebase-aware AI code reviewer: it reasons about your code semantically with the repository as context, so it catches things pattern matching can't (contract drift, ripple effects across files, semantic bugs). Most teams keep their static analyzers alongside Macroscope; the two solve different problems.

What is Approvability?

Approvability is a Macroscope feature that auto-approves PRs the system can confidently classify as safe (small, low-risk changes whose blast radius is contained). It's the inverse signal of high-precision detection: if the reviewer is confident it can flag bugs, it can also be confident when there are none. Opt-in per repository and tunable per file pattern.

What are Check Run Agents?

Check Run Agents are how teams enforce their own conventions in Macroscope. Each agent is a Markdown file in .macroscope/check-run-agents/ that describes a custom rule in plain English. The agent runs as its own GitHub Check Run on every PR, sees the full repository context, and can block merge on failure. Closer to writing a review note for a teammate than configuring a linter.

How does Macroscope review every PR if it doesn't only look at the diff?

Macroscope's review pipeline reads the diff, but it also reads relevant parts of the surrounding repository: the files that contain callers, definitions, related logic, and configuration tied to the change. The result is a review grounded in the codebase, not in the diff in isolation.

How much does Macroscope cost?

$0.05 per KB of diff reviewed, with a ~$0.95 historical average per review and spend caps at $10 per-review and $50 per-PR. Every new workspace starts with $100 in free usage (no card required). Open-source repositories are free. There are no per-developer seat fees, so adding a part-time contributor costs almost nothing, and PR-volume growth from AI coding agents tracks directly to cost instead of triggering per-seat overages.

Where can I read the full benchmark methodology?

The public 118-bug Code Review Benchmark writes up the dataset (118 self-contained runtime bugs from 45 open-source repositories in 8 languages), the per-tool detection rates, the per-language breakdowns, and the rubric used to classify comments as runtime-relevant.

Does Macroscope train on my code?

No. Macroscope does not train models on customer source code, and its agreements with OpenAI and Anthropic prohibit those providers from training on Macroscope customer data. Full details on the public trust center.