Skip to content

docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst#253

Merged
galshubeli merged 1 commit into
feat/cypher-aggregation-accuracyfrom
docs/cypher-first-recommend-sentence-chunker
May 17, 2026
Merged

docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst#253
galshubeli merged 1 commit into
feat/cypher-aggregation-accuracyfrom
docs/cypher-first-recommend-sentence-chunker

Conversation

@galshubeli
Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli commented May 14, 2026

Summary

Adds a "Recommended ingestion config" section to CypherFirstAggregationStrategy's docstring pointing users at SentenceTokenCapChunking. Cosmetic / discoverability — the strategy itself works the same.

Why

The default chunker (FixedSizeChunking) splits on a character window with no awareness of sentence or word boundaries, so entity names get truncated at chunk boundaries ("Wayne Enterprises""Wayne En"). The stub entities never merge with their full forms during resolution, so cypher counts come back inflated.

On the internal 7-question aggregation benchmark, switching the benchmark's ingestion config from FixedSizeChunking(chunk_size=1000) to SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) moved the score from 6/7 (intermittent) to 7/7 (stable across three runs).

Stacking

This PR is stacked on #255 (feat/cypher-aggregation-accuracy). After #255 merges, GitHub will auto-retarget this to main and the diff stays a 33-line docstring change.

Related: #254 makes SentenceTokenCapChunking the SDK's default chunker — that's a behavior change for all callers, kept separate from this docs-only PR.

Test plan

  • All existing tests pass on the parent branch
  • No new tests needed (docstring only)

Adds a "Recommended ingestion config" section to
CypherFirstAggregationStrategy's docstring.

The default chunker (FixedSizeChunking) splits on a character window
with no sentence/paragraph awareness, so entity names get truncated at
chunk boundaries — "Wayne Enterprises" becomes "Wayne En" in one chunk
and "terprises..." in the next. The resulting stub entities never merge
with their full forms during resolution, so cypher counts come back
inflated and "which X" lists pick up phantoms.

On the internal 7-question aggregation benchmark, switching from
FixedSizeChunking(chunk_size=1000) to
SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) moved the
score from 6/7 (intermittent) to 7/7 (stable across three runs). The
post-ingest graph state went from 11-14 organization nodes including
"Glo" / "Initech System" / "Wayne En" stubs to exactly 10 clean orgs,
and from 66-80 Person nodes (with Carla / Carla Okafor duplicates) to
exactly 56 distinct persons.

The short-form duplicate fix is a side benefit: sentence-aware chunks
follow natural prose boundaries with overlap, so a person's first
mention almost always lands in the same chunk as their later short-form
references — per-chunk FastCoref then has the antecedent it needs.

This docstring is the smallest useful change. A separate PR will propose
changing the SDK default chunker; that's a larger conversation since it
affects every existing caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

This PR introduces CypherFirstAggregationStrategy, a new retrieval strategy that routes aggregation-style questions deterministically through database queries. It enhances Cypher generation, result assembly, system-prompt handling, and adds the complete strategy with supporting utilities and tests.

Changes

Cypher-First Aggregation Strategy

Layer / File(s) Summary
Package exports and public API
graphrag_sdk/src/graphrag_sdk/__init__.py, graphrag_sdk/src/graphrag_sdk/retrieval/strategies/__init__.py
New strategy and phrase-extractor interfaces exported at package façade level for public consumption.
System prompt authority rule and detection
graphrag_sdk/src/graphrag_sdk/api/main.py
Conditional cypher-authority instruction (_CYPHER_AUTH_RULE) detects authoritative cypher-derived content in retriever results and injects rule into system prompt during completion to prefer deterministic quantitative answers.
Cypher query generation and validation enhancements
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
Pure-aggregation detection gates conditional LIMIT injection; unsupported namespaced functions (apoc.*, gds.*, db.*) are rejected during validation; schema prompts updated with BOTH/intersection guidance; zero-row fallback retries with widened typed labels (except for pure aggregations); metadata tracked for fallback status and row counts.
Result assembly and metadata flow
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py, graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
Cypher result cap increased to 100 rows with explicit truncation messaging; result metadata merged to expose fallback behavior and truncation flags; MultiPathRetrieval wired to pass cypher metadata through orchestration layer.
CypherFirstAggregationStrategy core implementation
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py
Intent classification (numeric\_math, aggregation, rag) routes queries to four sub-paths: numeric extraction and Python computation, shared-property hybrid via fuzzy intersection, multi-candidate Cypher table with row-count scoring, or RAG fallback. Includes intent detectors, phrase extractors (pluggable interface with default regex-based English impl), fuzzy matching, markdown table formatting, numeric coercion, path metadata tagging, and negation-existential "No" handling.
Comprehensive test coverage
graphrag_sdk/tests/test_cypher_first.py, graphrag_sdk/tests/test_cypher_generation.py, graphrag_sdk/tests/test_facade.py
Unit tests for intent detection, shape predicates, extraction, fuzzy intersection, formatting; end-to-end routing scenarios with mocks; Cypher validation/sanitization coverage including LIMIT injection and label-widening fallback; system-prompt authority rule detection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • Naseem77

Poem

🐰 A strategy hops through the query tree,
Finding aggregates deterministically,
Cypher paths bloom, fuzzy intersections dance,
While fallbacks waltz if numbers don't prance. ✨📊

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The title describes a documentation update recommending SentenceTokenCapChunking for CypherFirst, but the raw_summary shows this PR implements a major Cypher-first aggregation retrieval strategy (~1,200 lines), extensive bug fixes, and helper functions—far beyond a simple documentation change. Update the title to accurately reflect the primary change: 'feat(retrieval): add CypherFirstAggregationStrategy with intelligent query routing' or similar, since the documentation recommendation is incidental to the feature implementation.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/cypher-first-recommend-sentence-chunker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a “Cypher-first” aggregation-focused retrieval strategy and extends existing Cypher retrieval plumbing so deterministic graph-query results can be surfaced (and treated as authoritative) in the final LLM prompt when applicable.

Changes:

  • Add CypherFirstAggregationStrategy with intent routing (RAG vs aggregation vs numeric-math) plus helpers (table formatting, fuzzy intersect, phrase extractors).
  • Enhance text-to-Cypher validation/sanitization (row-limit behavior, reject dotted namespaces, typed-label widening fallback) and propagate cypher execution metadata through MultiPathRetrieval → assembled results.
  • Conditionally append a cypher-authority rule to the system prompt only when authoritative cypher results are present; add/expand tests for these behaviors.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
graphrag_sdk/tests/test_facade.py Adds tests ensuring the cypher-authority prompt rule is injected only when authoritative cypher context is present.
graphrag_sdk/tests/test_cypher_generation.py Expands validator/sanitizer tests; updates execute-cypher retrieval tests for new metadata + fallback behavior.
graphrag_sdk/tests/test_cypher_first.py New tests covering cypher-first helper functions and mocked end-to-end routing behavior.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py Renames/expands cypher results section, raises cypher context cap, and merges cypher diagnostics into result metadata.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py Updates cypher execution integration to accept and pass through cypher metadata into assembled results.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py Adds pure-aggregation detection for LIMIT injection, dotted-namespace rejection, and typed-label widening retry with diagnostics.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py New cypher-first aggregation strategy implementation (intent detection, hybrid paths, numeric math path, table formatting, phrase extraction).
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/init.py Exposes the new cypher-first strategy and extractor interfaces from the strategies package.
graphrag_sdk/src/graphrag_sdk/api/main.py Adds conditional system-prompt addendum injection based on presence of authoritative cypher results.
graphrag_sdk/src/graphrag_sdk/init.py Exports the new cypher-first strategy and phrase extractor types at the top-level package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +506 to +510
cypher = None
else:
cypher = _sanitize_cypher(cypher)
result = await self._s._graph.query_raw(cypher)
for row in (result.result_set or []):
Comment on lines +1035 to +1039
if intent == "rag":
# Non-aggregation questions don't benefit from any of the
# cypher-first mechanics — hand straight to the fallback.
return await self._rag_path.maybe_handle(query, ctx)

r"\bboth\s+\w+(?:\s+\S+)*?\s+and\s+\w+\b",
r"\bbetween\s+\w+\s+and\s+\w+\b",
r"\bexactly\s+\d+\b", r"\bat\s+least\s+\d+\b",
r"\b(?:does|do|has|have|is|are)\s+(?:[A-Z]\w*\s+){1,4}(?:have|has|contain|own|run|host)\b",
Comment on lines +584 to +588
batch_cypher = (
"MATCH (o:Organization)<-[:RELATES]-(p:Person) "
"OPTIONAL MATCH (p)-[:MENTIONED_IN]->(c:Chunk) "
"RETURN o.name AS org, p.name AS person, "
" p.description AS desc, collect(DISTINCT c.text) AS chunks"
Comment on lines +44 to +48
# Default row cap auto-injected when the LLM's query lacks a LIMIT.
# Pure aggregations (count/sum/avg over no group-by) return one row, so
# they skip injection entirely; group-by lists need enough rows that a 10-org
# breakdown isn't truncated.
_DEFAULT_ROW_LIMIT = 100
Comment on lines +95 to +99
# Optional addendum appended to the system prompt only when the retriever
# produced an "Authoritative Graph Query Results" section (currently emitted
# by ``CypherFirstAggregationStrategy`` and by ``MultiPathRetrieval`` with
# ``enable_cypher=True``). Kept out of the base prompts so callers who never
# use cypher retrieval don't pay the prompt-token cost or risk subtle
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py`:
- Around line 509-514: The loop over result.result_set currently appends every
numeric-looking cell per row, which can overcount; change the logic in the block
after result = await self._s._graph.query_raw(cypher) so that for each row you
scan cells and append only the single intended numeric value (e.g., the first
non-None result of _coerce_number(cell)) instead of every numeric cell—update
the nested loops that call _coerce_number and push into values to break after
the first numeric is found (or otherwise select by expected column index/name)
to ensure one numeric aggregation per row.
- Around line 541-544: The current debug string construction in cypher_first.py
that builds f"source_values ({len(values)} rows): {', '.join(... for v in
values)}" can produce huge payloads; modify the formatter used in the
function/method that contains variables op, ans_str, values, cypher (the block
building that f-string) to cap how many values are serialized (e.g., limit to
first N rows or first M characters), and when truncated append a clear indicator
like "… (truncated X rows)" so the log includes the count but not all data;
ensure the same truncated representation is used wherever source_values is
rendered to avoid oversized retrieval contexts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eeba0db7-0789-4b0e-a209-87fe58171da6

📥 Commits

Reviewing files that changed from the base of the PR and between a90ad18 and 37e233a.

📒 Files selected for processing (10)
  • graphrag_sdk/src/graphrag_sdk/__init__.py
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/__init__.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py
  • graphrag_sdk/tests/test_cypher_first.py
  • graphrag_sdk/tests/test_cypher_generation.py
  • graphrag_sdk/tests/test_facade.py

Comment on lines +509 to +514
result = await self._s._graph.query_raw(cypher)
for row in (result.result_set or []):
for cell in row:
v = _coerce_number(cell)
if v is not None:
values.append(v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict numeric aggregation to one intended value per row.

This currently aggregates every numeric-looking cell in each row, which can silently skew results when the generated Cypher returns extra columns.

💡 Proposed fix
-                for row in (result.result_set or []):
-                    for cell in row:
-                        v = _coerce_number(cell)
-                        if v is not None:
-                            values.append(v)
+                for row in result.result_set:
+                    if not row:
+                        continue
+                    # Numeric path expects one value per row; use the first column only.
+                    v = _coerce_number(row[0])
+                    if v is not None:
+                        values.append(v)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py` around
lines 509 - 514, The loop over result.result_set currently appends every
numeric-looking cell per row, which can overcount; change the logic in the block
after result = await self._s._graph.query_raw(cypher) so that for each row you
scan cells and append only the single intended numeric value (e.g., the first
non-None result of _coerce_number(cell)) instead of every numeric cell—update
the nested loops that call _coerce_number and push into values to break after
the first numeric is found (or otherwise select by expected column index/name)
to ensure one numeric aggregation per row.

Comment on lines +541 to +544
f"computed {op} = {ans_str}\n"
f"source_values ({len(values)} rows): "
f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in values)}\n"
f"cypher: {cypher}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cap echoed source_values to prevent oversized retrieval context.

Serializing all extracted values into one line can produce very large context payloads on high-row queries and hurt completion reliability.

💡 Proposed fix
-        body = (
-            f"computed {op} = {ans_str}\n"
-            f"source_values ({len(values)} rows): "
-            f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in values)}\n"
-            f"cypher: {cypher}"
-        )
+        preview_cap = 50
+        preview = values[:preview_cap]
+        preview_suffix = (
+            f", ... (+{len(values) - preview_cap} more)" if len(values) > preview_cap else ""
+        )
+        body = (
+            f"computed {op} = {ans_str}\n"
+            f"source_values ({len(values)} rows): "
+            f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in preview)}"
+            f"{preview_suffix}\n"
+            f"cypher: {cypher}"
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py` around
lines 541 - 544, The current debug string construction in cypher_first.py that
builds f"source_values ({len(values)} rows): {', '.join(... for v in values)}"
can produce huge payloads; modify the formatter used in the function/method that
contains variables op, ans_str, values, cypher (the block building that
f-string) to cap how many values are serialized (e.g., limit to first N rows or
first M characters), and when truncated append a clear indicator like "…
(truncated X rows)" so the log includes the count but not all data; ensure the
same truncated representation is used wherever source_values is rendered to
avoid oversized retrieval contexts.

@galshubeli galshubeli changed the title Docs/cypher first recommend sentence chunker docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst May 14, 2026
@galshubeli galshubeli changed the base branch from main to feat/cypher-aggregation-accuracy May 14, 2026 14:17
@galshubeli galshubeli merged commit ef42a09 into feat/cypher-aggregation-accuracy May 17, 2026
9 checks passed
@galshubeli galshubeli deleted the docs/cypher-first-recommend-sentence-chunker branch May 17, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants