Skip to content

feat: add PageRank ranking, architecture summary, and token-budgeted responses#147

Closed
maplenk wants to merge 6 commits intoDeusData:mainfrom
maplenk:feat/pagerank-arch-summary
Closed

feat: add PageRank ranking, architecture summary, and token-budgeted responses#147
maplenk wants to merge 6 commits intoDeusData:mainfrom
maplenk:feat/pagerank-arch-summary

Conversation

@maplenk
Copy link
Copy Markdown
Contributor

@maplenk maplenk commented Mar 26, 2026

Summary

Adds structural importance ranking (PageRank), a one-call architecture overview tool, and token-budgeted responses to prevent context window overflow.

New tools

  • get_architecture_summary — Structured markdown overview of the project: top files by connectivity, route→controller→service chains, Louvain clusters, high fan-in functions, entry points. Supports max_tokens for output size control and focus for narrowing to a specific area.

  • get_key_symbols — Returns top-K functions/classes ranked by PageRank. Enables "what are the most important functions in this codebase?" queries.

Enhanced tools

  • search_graph — New ranked parameter (default true). When enabled, results are sorted by PageRank score. PageRank included in response JSON.

  • trace_call_path — New ranked parameter. BFS results post-sorted by PageRank when enabled.

  • search_graph, trace_call_path, query_graph — New max_tokens parameter. Two-tier truncation: top 5 results in full detail, remainder as compact signatures. Emits truncated, total_results, shown metadata.

Implementation details

  • PageRank: standard iterative algorithm (d=0.85, 20 iterations) with dangling node handling. Persisted in node_scores table. Runs as pipeline post-processing step. Non-fatal on failure.
  • Architecture summary: SQL queries against existing graph — no new indexing. Hash table lookups for O(1) file resolution. yyjson route property extraction.
  • Token budget: build-then-check approach (zero overhead on happy path). Compact chain summaries (A → ... (3 more) → Z) for truncated traces.
  • WAL-mode fix: read-only query opens use immutable SQLite URIs (fixes corrupt DB misclassification).

Tests

  • test_store_arch.c: architecture summary (basic, focus, many_files, cluster_growth)
  • test_store_search.c: PageRank computation + ranking
  • test_mcp.c: get_key_symbols, ranked search, truncation for all 3 tools
  • test_pipeline.c: PageRank in pipeline
  • test_integration.c: live index tests

Motivation

AI coding agents consume 7–38% of context window per structural query. PageRank ranking ensures the most important results appear first. Token budgets let agents request "give me the answer in under 2000 tokens." Architecture summaries eliminate entire categories of exploratory queries — one call replaces 3–5 tool invocations.

Benchmarked on a 32K-node / 70K-edge production Laravel codebase.


Part 1 of a 4-PR series. PRs 2–4 build on this foundation.


Built with OpenAI Codex and Claude Code.

maplenk added a commit to maplenk/codebase-memory-mcp that referenced this pull request Mar 26, 2026
All install paths, download URLs, self-update checks, CI workflows,
and documentation now reference maplenk/codebase-memory-mcp so the
fork can operate independently with its own releases while upstream
PRs (DeusData#147-DeusData#150) are pending. Upstream attribution in README fork
section and LICENSE preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@DeusData DeusData added the enhancement New feature or request label Mar 26, 2026
@DeusData
Copy link
Copy Markdown
Owner

Thanks @maplenk — PageRank for code ranking and architecture summaries is a great idea. Large PR — will review carefully.

Naman Khator and others added 6 commits March 27, 2026 18:10
Account for optional signatures in the search_graph and trace_call_path size estimators, and improve compact trace chains to report omitted-node counts.

This also documents the normal-path output enrichment introduced with Task 4: search_graph results now include file_path, start_line, end_line, and signature, and trace_call_path hop items now include file_path, start_line, and signature.
- Guard cbm_mcp_text_result() against NULL text
- Fix memory leak in handle_get_key_symbols() REQUIRE_STORE path (focus not freed)
- Wire qn_pattern through handle_search_graph()
- Fix OOM infinite loop in markdown_builder_reserve()
- Return 0 instead of CBM_STORE_ERR from summary_count_nodes() on prepare fail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@maplenk maplenk force-pushed the feat/pagerank-arch-summary branch from 1e02f10 to f3e93e7 Compare March 27, 2026 12:46
@DeusData
Copy link
Copy Markdown
Owner

Thanks for the effort here, @maplenk. I want to give honest feedback on the core premise before we go further.

PageRank is the wrong algorithm for code graphs. PageRank measures "if you randomly follow edges, where do you end up?" On the web, being linked-to is an editorial signal. In a call graph, being called by many things means you're a leaf utility — log.Error(), fmt.Sprintf(), strings.Contains(). These would rank highest, which is the opposite of architecturally important code. Handlers, orchestrators, and pipeline stages — the code that actually matters — typically have few callers but many callees. PageRank would rank them low.

We already expose min_degree/max_degree on search_graph, which gives you direct fan-in/fan-out filtering with zero computational overhead. That covers the "find heavily-connected code" use case without the conceptual mismatch.

The architecture summary and token-budget features are separate ideas worth discussing on their own merits — but they're bundled here with PageRank as the foundation, which makes it hard to evaluate them independently. Could you split those into standalone PRs?

Also noting: this PR modifies store.c (+1,587 lines) and mcp.c (+944 lines), which are core files. Changes of that magnitude to the store and MCP layers need very careful review, especially since this is part 1 of 4 — I need to understand the full scope before committing to a direction.

@maplenk
Copy link
Copy Markdown
Contributor Author

maplenk commented Mar 27, 2026

Hey @DeusData
Thanks for the details.

Will split the other features first and check on the PageRank algorithm as well!

@DeusData
Copy link
Copy Markdown
Owner

DeusData commented Apr 2, 2026

Thanks @maplenk for the thorough work here — the benchmarking on a real Laravel codebase and the detailed writeup are appreciated.

After evaluation, we're going to pass on this for now. Here's our reasoning:

PageRank ranking: For code graphs, simple in-degree counting gives nearly identical results to PageRank because code structure is hierarchical and predictable (unlike web link graphs where transitive weighting matters). If we need result ranking, ORDER BY degree DESC is a 1-line SQL change vs a new pipeline step + table + iterative algorithm.

get_architecture_summary / get_key_symbols: These overlap with existing tools — get_architecture already provides project summaries, and search_graph(min_degree=10) finds the most-connected symbols. We're trying to avoid tool inflation (currently at 14) since each tool adds cognitive load for LLMs parsing the tool list at session start.

max_tokens truncation: Agents already control result size via limit, offset, and depth parameters. Server-side truncation with opinionated formatting ("top 5 full, rest compact") removes control from the agent, which knows its own context budget better than we do.

These are reasonable ideas — we may revisit ranking or token budgets if users report specific pain points. For now the existing primitives cover the use cases.

@DeusData DeusData closed this Apr 2, 2026
@maplenk
Copy link
Copy Markdown
Contributor Author

maplenk commented Apr 3, 2026

Hey!!

Thanks for the details.
I understand, and realised that too have made some changes to those and will open a new issue with contribution guidelines soon!!

Please share your inputs on those 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants