Skip to content

fix(retrieval): rank MENTIONED_IN chunks by cosine in MultiPath Path C#259

Merged
galshubeli merged 1 commit into
mainfrom
fix/multipath-hub-entity-cosine-rank
May 18, 2026
Merged

fix(retrieval): rank MENTIONED_IN chunks by cosine in MultiPath Path C#259
galshubeli merged 1 commit into
mainfrom
fix/multipath-hub-entity-cosine-rank

Conversation

@galshubeli
Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli commented May 18, 2026

Summary

retrieve_chunks Path C uses COLLECT(c)[..3] with no ORDER BY. For hub entities — the main product name, generic verbs like "install", high-frequency Cypher keywords — this picks an arbitrary 3 out of hundreds, almost never the chunks relevant to the current query.

This PR adds an ORDER BY vec.cosineDistance(c.embedding, query_vector) before the COLLECT so per-entity chunk selection is query-aware. Closes the retrieval half of #258.

Repro / Evidence

Tested against the FalkorDB docs knowledge graph (1,356 chunks, 2,945 entities). The FalkorDB entity has 486 chunks linked via MENTIONED_IN. Six of those are install-relevant (chunks 0, 1, 2, 3, 6, 7 of getting-started/index.md).

P(an unordered 3 of 486 contains an install chunk) ≈ 4% per slot → ~12% chance overall. Measured behavior matches: standalone How do I install FalkorDB? reliably failed.

Before (arbitrary 3):

"This isn't covered in the FalkorDB docs. Related setup topics include Docker, Kubernetes, and the Python client installation."

After (cosine-ranked 3):

"FalkorDB can be deployed with Docker, including the falkordb/falkordb-server image, and there's also a Helm chart installation path for Kubernetes. For Python, pip install falkordb installs the FalkorDB client."

Same query, same graph, same retrieval strategy — only chunk_retrieval.py:Path C changed.

The change

- MATCH (e:__Entity__ {id: eid})-[:MENTIONED_IN]->(c:Chunk)
- WITH eid, COLLECT(c)[..3] AS chunks
+ MATCH (e:__Entity__ {id: eid})-[:MENTIONED_IN]->(c:Chunk)
+ WHERE c.embedding IS NOT NULL
+ WITH eid, c, vec.cosineDistance(c.embedding, vecf32($qv)) AS dist
+ ORDER BY eid, dist ASC
+ WITH eid, COLLECT(c)[..3] AS chunks

query_vector is already passed into retrieve_chunks — used by Path B (raw vector search) — so it's a parameter-add, not a new signature. Chunks without stored embeddings are excluded (rare; c.embedding IS NOT NULL guard).

Tests

Added test_mentioned_in_ranks_chunks_by_cosine to tests/test_multi_path_retrieval.py, asserting the emitted Cypher contains vec.cosineDistance, ORDER BY, and passes the query vector as a parameter. Full retrieval test suite passes (161/161, 1 skip).

Test plan

  • Unit tests pass (pytest tests/test_multi_path_retrieval.py — 45/45)
  • Broader retrieval suite passes (pytest -k 'retrieval or multi_path or chunk' — 161/161 + 1 skip)
  • Live end-to-end repro on FalkorDB docs widget: failing query now answers correctly with cited install commands
  • Reviewer to sanity-check perf on graphs with very large MENTIONED_IN fan-out (per-entity ordering is O(N log N) per entity, but N is bounded per entity by ingestion characteristics)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Improvements

    • Chunk retrieval now intelligently ranks results by relevance instead of arbitrary selection, ensuring more pertinent results are prioritized.
  • Tests

    • Added regression test to verify chunk ranking behavior functions correctly.

Review Change Stack

Path C in retrieve_chunks used `COLLECT(c)[..3]` with no ORDER BY, so
hub entities (which can be MENTIONED_IN hundreds of chunks) returned
an arbitrary 3 — almost never including the chunks most relevant to
the current query.

Add an ORDER BY on `vec.cosineDistance(c.embedding, query_vector)`
before the COLLECT so per-entity chunk selection is query-aware.

Refs #258

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5200f7f4-edf2-4ba0-bff8-5e9a240c81b8

📥 Commits

Reviewing files that changed from the base of the PR and between 12fcca2 and 1db6a75.

📒 Files selected for processing (2)
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/chunk_retrieval.py
  • graphrag_sdk/tests/test_multi_path_retrieval.py

📝 Walkthrough

Walkthrough

The PR updates the MENTIONED_IN chunk retrieval path to rank candidate chunks by cosine distance similarity between chunk embeddings and the query embedding, replacing arbitrary collection-based selection with distance-sorted top-3 per entity. A regression test verifies the updated Cypher includes cosine distance computation and the query vector parameter.

Changes

MENTIONED_IN Chunk Ranking by Cosine Distance

Layer / File(s) Summary
Cosine distance-based chunk ranking in Cypher
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/chunk_retrieval.py
The MENTIONED_IN retrieval Cypher is updated to filter non-null chunk embeddings, compute vec.cosineDistance per chunk, order by entity and ascending distance, and return the top 3 closest chunks per entity; the query vector is passed via query_raw parameters.
Regression test for cosine distance ranking
graphrag_sdk/tests/test_multi_path_retrieval.py
A new async test intercepts generated Cypher for the direct MENTIONED_IN chunk path, verifies presence of vec.cosineDistance and ORDER BY, and confirms the qv query vector parameter is included, ensuring ranking uses cosine distance rather than arbitrary order.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A query hops through embeddings near,
Cosine distance ranks chunks clear—
Top three per entity, sorted by glow,
Relevance blooms in the vector flow!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: ranking MENTIONED_IN chunks by cosine distance in the MultiPath retrieval strategy's Path C.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/multipath-hub-entity-cosine-rank

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@galshubeli galshubeli merged commit a174629 into main May 18, 2026
13 of 14 checks passed
@galshubeli galshubeli deleted the fix/multipath-hub-entity-cosine-rank branch May 18, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants