AI Code Review Agent

An autonomous AI agent that reviews GitHub Pull Requests for bugs, security vulnerabilities, and code quality issues — with built-in evaluation, observability, and honest benchmarks.

Live Dashboard · Project Blog · How It Works

What This Is

An end-to-end AI agent system with evaluation, observability, and benchmarking that demonstrates how LLM-based code review works in practice — and where it breaks.

This is not a CodeRabbit competitor. It's a portfolio project that honestly explores the capabilities and limitations of using LLMs for automated code review, backed by reproducible metrics and transparent failure analysis.

The agent runs as a GitHub Action, analyzes each file in a PR independently using Llama 3.3 70B via Groq, and posts structured review comments with severity levels, confidence scores, and suggested fixes. Every LLM call is traced with Langfuse for full observability.

Architecture

flowchart LR
    PR[Pull Request] --> GHA[GitHub Actions]
    GHA --> FETCH[Fetch PR Diff]
    FETCH --> PIPELINE

    subgraph PIPELINE[LangGraph Pipeline]
        direction LR
        A[parse_diff] --> B[filter_files]
        B --> C[analyze_files]
        C --> D[aggregate]
        D --> E[format_review]
    end

    PIPELINE --> POST[Post Review Comment]
    PIPELINE -.-> LF[Langfuse Tracing]

Node	What It Does
`parse_diff`	Split unified diff into per-file chunks
`filter_files`	Skip non-code files (.lock, .md, .json, .yaml, images)
`analyze_files`	LLM review per file via Llama 3.3 70B
`aggregate`	Validate and combine findings via Pydantic
`format_review`	Render GitHub markdown comment grouped by severity

Pipeline State flows through a ReviewState TypedDict:

pr_url → raw_diff → file_diffs → filtered_files → findings → summary → formatted_review

Each node is decorated with @observe for Langfuse tracing. The analyze_files node calls the LLM once per file with a structured JSON schema, then validates output through a 3-tier extraction fallback (direct parse → markdown fence extraction → regex).

How It Works

1. PR opened/updated  →  2. GitHub Actions triggers  →  3. LangGraph pipeline  →  4. Review comment posted

A PR is opened or updated on a repository with the GitHub Actions workflow installed
GitHub Actions triggers scripts/run_review.py, which fetches the unified diff via the GitHub API
The LangGraph pipeline processes the diff through 5 nodes:
- Parses the diff into per-file chunks
- Filters out non-code files (lockfiles, images, config)
- Sends each code file to Llama 3.3 70B for analysis with a structured prompt
- Validates and aggregates findings using Pydantic models
- Formats a markdown review grouped by severity
The agent posts a review comment on the PR with findings organized as critical / warning / suggestion, each with file location, category, confidence score, and a suggested fix

Run locally against any public PR:

python -m src.agent.main --pr-url "https://github.com/owner/repo/pull/123"

Evaluation Results

The agent is evaluated against a benchmark suite of 10 synthetic PR diffs containing 30 known bugs across security vulnerabilities, logic errors, and common bug patterns.

Aggregate Metrics

Metric	Score
Precision	68.2%	How many findings are real issues
Recall	50.0%	How many real issues are found
F1 Score	57.7%	Harmonic mean of precision & recall
False Positive Rate	31.8%	~1 in 3 findings is noise

Per-Category Accuracy

Category	Expected	Detected	Accuracy
Security	21	10	`47.6%`
Bug	6	3	`50.0%`
Logic	3	2	`66.7%`

Benchmark cost: 52,340 tokens · 94.7s latency · ~$0.031 estimated cost at Groq pricing

Per-benchmark breakdown

Benchmark	TP	FP	FN	Precision	Recall	Missed
`buffer_overflow`	2	0	1	100%	66.7%	sprintf without size limit
`hardcoded_secret`	3	1	2	75.0%	60.0%	JWT secret, SMTP password
`insecure_deserialization`	3	1	2	75.0%	60.0%	pickle.loads on backup/webhook
`logic_error`	2	1	1	66.7%	66.7%	OR instead of AND
`missing_validation`	2	1	2	66.7%	50.0%	Path traversal, input format
`sql_injection`	1	0	1	100%	50.0%	f-string query in get_user
`xss_vulnerability`	1	1	1	50.0%	50.0%	Stored XSS without escaping
`off_by_one`	1	0	2	100%	33.3%	Integer division, negative index
`null_reference`	0	1	1	0%	0%	NoneType subscript on payment
`race_condition`	0	1	2	0%	0%	check_and_increment, transfer

Limitations

This section exists because honest evaluation matters more than impressive numbers.

False positive rate of 32% — roughly 1 in 3 findings is noise. In a production setting, this erodes developer trust quickly.
Security recall at 48% — the agent misses more than half of security vulnerabilities. For context, GPT-4 scored just 13% on the SecLLMHolmes benchmark for real-world vulnerability detection. This agent uses a smaller model on synthetic examples, so the 48% figure is not transferable to production codebases.
Failure modes observed:
- Large files (>500 lines of diff) degrade quality as context fills up
- Complex multi-file logic bugs are missed because files are analyzed independently
- Race conditions and concurrency bugs were completely missed (0% recall)
- The agent has no codebase context beyond the diff — it can't reason about call sites, types, or invariants
This is LLM-assisted triage, not automated scanning. The agent is best understood as a first-pass reviewer that catches surface-level issues and flags areas for human attention. It does not replace SAST tools, linters, or human review.

Observability

Every pipeline run is traced with Langfuse, providing visibility into:

Per-node execution time and token usage
Per-file LLM call cost breakdown
Finding distribution across severity and category
Error tracking for failed parses or API issues

Explore traces interactively on the live dashboard (Observability tab).

# Export traces as JSON
python -m src.agent.main --pr-url <URL> --export-trace

Tech Stack

Tool	Purpose	Why This One
LangGraph	Agent orchestration	Explicit state machine with typed state — easier to debug and test than chain-based approaches
Groq (Llama 3.3 70B)	LLM inference	Free tier (1K req/day), fast inference, strong code understanding for an open model
Langfuse	Observability & tracing	Open-source, self-hostable, first-class LangGraph integration via `@observe`
Pydantic	Structured output	Type-safe finding models with automatic validation and serialization
GitHub Actions	CI/CD trigger	Zero infrastructure — runs on PR events without needing a webhook server
Streamlit	Live dashboard	Fastest path from Python to interactive web app for showcasing results
httpx	Async HTTP	Modern async client for GitHub API calls with proper error handling
pytest	Testing	Async test support, good fixture model for benchmark evaluation tests

Quick Start

Prerequisites

Python 3.11+
A Groq API key (free tier)
A GitHub personal access token (for reviewing private repos)

Setup

git clone https://github.com/rishimule/ai-code-review-agent.git
cd ai-code-review-agent
pip install -e .

cp .env.example .env
# Edit .env with your API keys:
#   GROQ_API_KEY=gsk_...
#   GITHUB_TOKEN=ghp_...          (optional, for private repos)
#   LANGFUSE_PUBLIC_KEY=pk-lf-... (optional, for tracing)
#   LANGFUSE_SECRET_KEY=sk-lf-... (optional, for tracing)

Run

# Review a PR
python -m src.agent.main --pr-url "https://github.com/owner/repo/pull/123"

# Run evaluation benchmarks
python -m src.eval.evaluator

# Compare models
python -m src.eval.compare_models --models llama-3.3-70b-versatile,llama-3.1-8b-instant

# Run tests
pytest

# Launch the dashboard locally
streamlit run dashboard/app.py

Deployment

To set up automated PR reviews on your own repository:

Fork this repository (or copy the workflow file)
Add repository secrets in Settings → Secrets and variables → Actions:
- GROQ_API_KEY — your Groq API key
Copy the workflow to your target repo:
```
.github/workflows/code-review.yml
```
Open a PR — the agent will automatically run and post a review comment

The workflow triggers on pull_request events (opened and synchronized) for code files only. It uses GITHUB_TOKEN provided by Actions for posting comments — no additional token setup needed for public repos.

Supported languages: Python, JavaScript, TypeScript, Go, Rust, Java, Ruby, C/C++, C#, PHP, Swift, Kotlin, Scala, Shell, SQL

Competitive Landscape

Tool	Approach
CodeRabbit	Commercial, full-repo context, incremental learning
GitHub Copilot Code Review	Integrated into GitHub, backed by GPT-4
Qodo PR-Agent	Open-source, multi-model, extensive customization

What this project demonstrates differently:

Evaluation rigor — A reproducible benchmark suite with ground truth, not just "it found some issues." The metrics are computed against known bugs with matching logic, and the results include false positives and missed findings.
Honest limitations — Production tools market recall and precision without publishing methodology. This project documents exactly what the agent catches, what it misses, and why.
Full observability — Every LLM call is traced with cost, latency, and token breakdowns. This demonstrates the engineering practice of treating LLM calls as observable operations, not black boxes.
Transparent architecture — The LangGraph pipeline is explicit about state transitions and can be inspected, tested, and extended node by node.

What I'd Do Differently / Next Steps

Accuracy improvements

SAST integration — Pipe Semgrep or CodeQL findings into the LLM prompt as grounding context
Codebase indexing — Build a vector index for retrieving type definitions, call sites, and invariants
Multi-pass review — A second LLM pass to self-critique and filter low-confidence noise

Feedback loop

Developer feedback collection — Track which findings developers dismiss vs. act on
Fine-tuning — With enough labeled data, fine-tune a smaller model on the specific task

Infrastructure

Caching — Cache analysis results for unchanged files across PR updates
Parallel file analysis — Concurrent analysis with a paid API tier
Webhook server — For high-volume orgs, a persistent service with queuing

Project Structure

src/
  agent/          LangGraph pipeline nodes and graph definition
  models/         Pydantic models for review findings
  github_client/  GitHub API interaction (fetching diffs, posting comments)
  prompts/        Prompt templates
  eval/           Evaluation harness and benchmark suite
  observability/  Langfuse tracing, cost tracking, trace export
benchmarks/       Known-buggy PR diffs and ground truth
dashboard/        Streamlit app
scripts/          GitHub Actions entry point
tests/            Unit and integration tests
.github/          GitHub Actions workflow

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.streamlit		.streamlit
benchmarks		benchmarks
dashboard		dashboard
docs		docs
scripts		scripts
src		src
tests		tests
traces		traces
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Code Review Agent

What This Is

Architecture

How It Works

Evaluation Results

Aggregate Metrics

Per-Category Accuracy

Limitations

Observability

Tech Stack

Quick Start

Prerequisites

Setup

Run

Deployment

Competitive Landscape

What I'd Do Differently / Next Steps

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Code Review Agent

What This Is

Architecture

How It Works

Evaluation Results

Aggregate Metrics

Per-Category Accuracy

Limitations

Observability

Tech Stack

Quick Start

Prerequisites

Setup

Run

Deployment

Competitive Landscape

What I'd Do Differently / Next Steps

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages