-
Notifications
You must be signed in to change notification settings - Fork 2
feat: path command, expanded benchmarks, docs updates #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
360bdcc
feat: add codegraph path for A→B symbol pathfinding
github-actions[bot] bad02f6
docs: add Titan Paradigm use case, update docs with roles/co-change/path
github-actions[bot] 53f9a83
docs: restore Architecture Refactoring phase, fix references
github-actions[bot] ab57fb6
fix: correct MCP tool counts and backlog ID collisions
github-actions[bot] 51fedb4
feat: add token savings benchmark (codegraph vs raw navigation)
github-actions[bot] 845e4c9
feat: extend benchmarks with incremental builds and expanded query co…
github-actions[bot] 1f6a2f4
ci: include version in automated benchmark commits and PRs
github-actions[bot] b273961
fix: update remaining 19-tool references to 21-tool in README
github-actions[bot] 26c31d1
chore: merge main into feat/path-command
github-actions[bot] d55f8ea
chore: resolve merge conflict in BUILD-BENCHMARKS.md
github-actions[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| # Token Savings Benchmark | ||
|
|
||
| Quantifies how much codegraph reduces token usage when AI agents navigate large codebases, compared to raw file exploration (Glob/Grep/Read/Bash). | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **Claude Agent SDK** | ||
| ```bash | ||
| npm install @anthropic-ai/claude-agent-sdk | ||
| ``` | ||
|
|
||
| 2. **API key** | ||
| ```bash | ||
| export ANTHROPIC_API_KEY=sk-ant-... | ||
| ``` | ||
|
|
||
| 3. **Git** (for cloning Next.js) | ||
|
|
||
| 4. **codegraph** installed in this repo (`npm install`) | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| # Smoke test — 1 issue, 1 run (~$2-4) | ||
| node scripts/token-benchmark.js --issues csrf-case-insensitive --runs 1 > result.json | ||
|
|
||
| # View the JSON | ||
| cat result.json | jq .aggregate | ||
|
|
||
| # Generate the markdown report | ||
| node scripts/update-token-report.js result.json | ||
| cat docs/benchmarks/TOKEN-SAVINGS.md | ||
| ``` | ||
|
|
||
| ## Full Run | ||
|
|
||
| ```bash | ||
| # All 5 issues × 3 runs (~$10-20) | ||
| node scripts/token-benchmark.js > result.json | ||
| node scripts/update-token-report.js result.json | ||
| ``` | ||
|
|
||
| ## CLI Flags | ||
|
|
||
| | Flag | Default | Description | | ||
| |------|---------|-------------| | ||
| | `--runs <N>` | `3` | Number of runs per issue (medians used) | | ||
| | `--model <model>` | `sonnet` | Claude model to use | | ||
| | `--issues <id,...>` | all | Comma-separated subset of issue IDs | | ||
| | `--nextjs-dir <path>` | `$TMPDIR/...` | Reuse existing Next.js clone | | ||
| | `--skip-graph` | `false` | Skip codegraph rebuild (use existing DB) | | ||
| | `--max-turns <N>` | `50` | Max agent turns per session | | ||
| | `--max-budget <$>` | `2.00` | Max USD per session | | ||
| | `--perf` | `false` | Also run build/query perf benchmarks on the Next.js graph | | ||
|
|
||
| ## Available Issues | ||
|
|
||
| | ID | Difficulty | PR | Description | | ||
| |----|:----------:|---:|-------------| | ||
| | `csrf-case-insensitive` | Easy | #89127 | Case-insensitive CSRF origin matching | | ||
| | `ready-in-time` | Medium | #88589 | Incorrect "Ready in" time display | | ||
| | `aggregate-error-inspect` | Medium | #88999 | AggregateError.errors missing in output | | ||
| | `otel-propagation` | Hard | #90181 | OTEL trace context propagation broken | | ||
| | `static-rsc-payloads` | Hard | #89202 | Static RSC payloads not emitted/served | | ||
|
|
||
| ## Methodology | ||
|
|
||
| ### Setup | ||
| - **Target repo:** [vercel/next.js](https://github.com/vercel/next.js) (~4,000 TypeScript files) | ||
| - Each issue is a real closed PR with a known set of affected source files | ||
|
|
||
| ### Two conditions (identical except codegraph access) | ||
|
|
||
| **Baseline:** Agent has `Glob`, `Grep`, `Read`, `Bash` tools. No codegraph. | ||
|
|
||
| **Codegraph:** Agent has the same tools **plus** a codegraph MCP server providing structural navigation (symbol search, dependency tracking, impact analysis, call chains). | ||
|
|
||
| ### Controls | ||
| - Same model for both conditions | ||
| - Same issue prompt (bug description only — no hints about the solution) | ||
| - Checkout pinned to the commit *before* the fix (agent can't see the answer in git history) | ||
| - Same `maxTurns` and `maxBudgetUsd` budget caps | ||
|
|
||
| ### Metrics | ||
| - **Input tokens:** Total tokens sent to the model (primary metric) | ||
| - **Cost:** USD cost of the session | ||
| - **Turns:** Number of agent turns (tool-use round-trips) | ||
| - **Hit rate:** Percentage of ground-truth files correctly identified | ||
| - **Tool calls:** Breakdown by tool type | ||
|
|
||
| ### Statistical handling | ||
| - N runs per issue (default 3), median used to handle non-determinism | ||
| - Error runs are excluded from aggregation | ||
|
|
||
| ## Cost Estimate | ||
|
|
||
| | Scenario | Approximate cost | | ||
| |----------|----------------:| | ||
| | 1 issue × 1 run | $2-4 | | ||
| | 1 issue × 3 runs | $6-12 | | ||
| | 5 issues × 3 runs | $30-60 | | ||
|
|
||
| Costs depend on model choice and issue difficulty. The `--max-budget` flag caps individual sessions. | ||
|
|
||
| ## Adding New Issues | ||
|
|
||
| Edit `scripts/token-benchmark-issues.js` and add an entry to the `ISSUES` array: | ||
|
|
||
| ```js | ||
| { | ||
| id: 'short-slug', | ||
| difficulty: 'easy|medium|hard', | ||
| pr: 12345, | ||
| title: 'PR title', | ||
| description: 'Bug description for the agent (no solution hints)', | ||
| commitBefore: 'abc123def...', // SHA before the fix | ||
| expectedFiles: ['packages/next/src/path/to/file.ts'], | ||
| } | ||
| ``` | ||
|
|
||
| Requirements: | ||
| - Use a real closed PR with a clear bug description | ||
| - `commitBefore` must be the parent of the merge commit (not the merge itself) | ||
| - `expectedFiles` should list only source files, not tests | ||
| - Verify the SHA exists: `git log --oneline <sha> -1` in the Next.js repo | ||
|
|
||
| ## Output Format | ||
|
|
||
| The runner outputs JSON to stdout. See [TOKEN-SAVINGS.md](TOKEN-SAVINGS.md) for the generated report. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # Token Savings Benchmark: codegraph vs Raw Navigation | ||
|
|
||
| Measures how much codegraph reduces token usage when an AI agent navigates | ||
| the [Next.js](https://github.com/vercel/next.js) codebase (~4,000 TypeScript files). | ||
|
|
||
| *No benchmark data yet. Run the benchmark to populate this report:* | ||
|
|
||
| ```bash | ||
| node scripts/token-benchmark.js > result.json | ||
| node scripts/update-token-report.js result.json | ||
| ``` | ||
|
|
||
| See [README.md](README.md) for full instructions. | ||
|
|
||
| <!-- TOKEN_BENCHMARK_DATA | ||
| [] | ||
| --> |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tool count is incorrect — says "19 MCP tools" but
BASE_TOOLSinsrc/mcp.jscontains 21 tools after addingsymbol_pathShould be "21 MCP tools" (22 in multi-repo mode with
list_repos)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed — updated all 3 occurrences in README.md from "19-tool" to "21-tool" (lines 100, 147, 173). BASE_TOOLS has 21 tools, 22 in multi-repo with list_repos.