241 issue expand document loader coverage by drr00t · Pull Request #262 · FalkorDB/GraphRAG-SDK

drr00t · 2026-05-19T00:27:11Z

Summary

This feature extends the loader interface using Docling to add support for DOCX, XLSX, PPTX, CSV, HTML, and XHTML.

Changes

DoclingBaseLoader: Base class wrapping the Docling document model.

Test Plan

All existing tests pass (pytest tests/ -q)
New tests added for new functionality (if applicable)
Lint passes (ruff check src/)

Notes

Currently, the GraphRAG_SDK loader interface maps one loader to one extension. Since Docling supports multiple extensions, this PR updates the current loader implementation paradigm to support multi-format loaders.

Summary by CodeRabbit

Release Notes

New Features
- Added support for ingesting multiple document formats: Word documents (.docx), Excel spreadsheets (.xlsx), PowerPoint presentations (.pptx), HTML files, and CSV files.
- Updated default document chunking strategy to better preserve sentence boundaries and document structure.
Chores
- Added Docling as an optional dependency for document processing.

…te() Changes the default chunker that ``GraphRAG.ingest()`` and ``GraphRAG.update()`` fall back to when the caller doesn't pass an explicit ``chunker=``. Was ``FixedSizeChunking()``; now ``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults). Why --- ``FixedSizeChunking`` splits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold. This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching to ``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly 10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla Okafor`` duplicates) to exactly 56 distinct persons — matching the corpus. A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds ``Carla → Carla Okafor`` reliably. That eliminates the short-form-duplicate class too, not just the truncation stubs. Compatibility ------------- ``FixedSizeChunking`` remains exported and fully supported — callers who explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite. Callers who relied on the previous default and want to keep it should pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out the new default and reference ``FixedSizeChunking`` as the opt-in character-window alternative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-19T00:27:23Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: be0e8d21-7049-4952-acaa-b8da91a34e1d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR extends GraphRAG SDK with docling-based loaders for multiple document formats (DOCX, XLSX, PPTX, HTML, CSV), updates packaging dependencies, shifts the default ingestion chunker from FixedSizeChunking to SentenceTokenCapChunking, and integrates auto-detection routing in the main API with comprehensive test coverage.

Changes

Docling-based document loaders and API updates

Layer / File(s)	Summary
Dependencies and DoclingBaseLoader base implementation `graphrag_sdk/pyproject.toml`, `graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_base.py`	Adds `docling>=2.91.0` as optional dependency; implements `DoclingBaseLoader(LoaderStrategy)` with async `load()` delegating to `_load_sync()` for document extraction. Validates file existence, handles missing docling gracefully, converts via `DocumentConverter`, maps docling labels to GraphRAG element types, maintains breadcrumb stacks based on header hierarchy, and returns `DocumentOutput` with assembled text, elements, and file metadata.
Format-specific loader implementations `graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docx_loader.py`, `xlsx_loader.py`, `pptx_loader.py`, `html_loader.py`, `csv_loader.py`	Introduces `DocxLoader`, `XlsxLoader`, `PptxLoader`, `HtmlLoader`, and `CsvLoader` as lightweight `DoclingBaseLoader` subclasses, each defining an `extension_name` property to identify their supported file format.
Loader package exports and main API integration `graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py`, `graphrag_sdk/src/graphrag_sdk/api/main.py`	Re-exports all five format loaders from loaders package; imports loaders and updates main.py with `SentenceTokenCapChunking`. Implements `_default_loader_for()` to match file extensions (`.docx`, `.xlsx`, `.pptx`, `.html`/`.xhtml`, `.csv`) to appropriate loaders with `TextLoader()` fallback. Updates `_ingest_single()`, `update()`, and docstrings to default to `SentenceTokenCapChunking` instead of `FixedSizeChunking`.
DoclingBaseLoader test suite `graphrag_sdk/tests/test_docling_loaders.py`	Adds `MockDocxLoader` and `TestDoclingBaseLoader` with seven async tests validating: missing docling dependency wrapping, docling label-to-element mapping, breadcrumb construction from hierarchy, missing file detection, conversion error wrapping, markdown export fallback, and specialized element types for lists/tables/code.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

FalkorDB/GraphRAG-SDK#254: Updates default ingestion chunker from FixedSizeChunking to SentenceTokenCapChunking in the same file (main.py).

Poem

🐰 Docling arrives to read the written word,
In DOCX, XLSX, where text lay stirred,
With breadcrumbs traced through headers tall,
The SDK ingests them all!
From CSV to HTML, no doc's unheard. 📚✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Expand document loader coverage' accurately reflects the main objectives: adding support for multiple document formats (DOCX, XLSX, PPTX, CSV, HTML) via Docling integration and expanding the loader interface.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/pyproject.toml`:
- Around line 59-67: The extras specification is inconsistent: the standalone
"docling" extra pins docling>=2.91.0 while the "all" extras list contains
docling>=2.0.0; update the "all" extras entry to require the same minimum
(docling>=2.91.0) so installing graphrag-sdk[all] cannot pull an older
incompatible docling version—edit the "docling" entry inside the all array in
pyproject.toml to match the docling extra's minimum version.

In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 329-333: Update the stale loader-default docstring that currently
reads "Loader: auto-detected from file extension (PDF or text)" so it reflects
the new extension routing; locate the docstring containing that exact phrase in
graphrag_sdk/api/main.py (the help/usage text shown in the diff) and replace it
with a concise description like "Loader: auto-detected from file extension (PDF,
DOCX, XLSX, PPTX, HTML/XHTML, CSV, or plain text)" so the user-facing docs list
the supported formats.

In `@graphrag_sdk/tests/test_docling_loaders.py`:
- Around line 24-27: The test is breaking the import system by having the
patched builtins.__import__ return None for non-target imports; change the patch
side_effect to delegate to the real importer for all names except the one you
want to simulate failing (i.e., capture the original_import =
builtins.__import__ and in the side_effect for the patch used around loader.load
call, raise ImportError("module not found") when name ==
"docling.document_converter" and otherwise return original_import(name, *args,
**kwargs)); keep the pytest.raises assertion around await loader.load(str(file),
ctx) unchanged and reference the patched builtins.__import__ side_effect that
delegates for everything except "docling.document_converter".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6566ed76-11b6-4ce9-8ef8-6fb548181321

📥 Commits

Reviewing files that changed from the base of the PR and between a174629 and 9bfbab4.

📒 Files selected for processing (10)

graphrag_sdk/pyproject.toml
graphrag_sdk/src/graphrag_sdk/api/main.py
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/csv_loader.py
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_base.py
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docx_loader.py
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/html_loader.py
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/pptx_loader.py
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/xlsx_loader.py
graphrag_sdk/tests/test_docling_loaders.py

coderabbitai · 2026-05-19T00:30:53Z

+docling = ["docling>=2.91.0"]
 all = [
    "openai>=1.0,<2.0",
    "anthropic>=0.20,<1.0",
    "cohere>=5.0",
    "sentence-transformers>=2.0",
    "pypdf>=6.9.2",
    "litellm>=1.83.0,<2.0",
+    "docling>=2.0.0",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="graphrag_sdk/pyproject.toml" echo "Docling constraints found in optional dependencies:" rg -n 'docling>=([0-9]+\.){1,2}[0-9]+' "$FILE" echo echo "Expected verification result:" echo "- The dedicated 'docling' extra and the 'all' extra should use the same minimum version."

Repository: FalkorDB/GraphRAG-SDK

Length of output: 295

Align docling minimum version across extras.

Line 59 requires docling>=2.91.0, but Line 67 allows docling>=2.0.0 in all. Installing graphrag-sdk[all] can pull an older docling than the dedicated docling extra, which breaks compatibility for the new loaders.

Proposed fix

all = [ @@ - "docling>=2.0.0", + "docling>=2.91.0", ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

docling = ["docling>=2.91.0"]

all = [

"openai>=1.0,<2.0",

"anthropic>=0.20,<1.0",

"cohere>=5.0",

"sentence-transformers>=2.0",

"pypdf>=6.9.2",

"litellm>=1.83.0,<2.0",

"docling>=2.0.0",

docling = ["docling>=2.91.0"]

all = [

"openai>=1.0,<2.0",

"anthropic>=0.20,<1.0",

"cohere>=5.0",

"sentence-transformers>=2.0",

"pypdf>=6.9.2",

"litellm>=1.83.0,<2.0",

"docling>=2.91.0",

]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/pyproject.toml` around lines 59 - 67, The extras specification is inconsistent: the standalone "docling" extra pins docling>=2.91.0 while the "all" extras list contains docling>=2.0.0; update the "all" extras entry to require the same minimum (docling>=2.91.0) so installing graphrag-sdk[all] cannot pull an older incompatible docling version—edit the "docling" entry inside the all array in pyproject.toml to match the docling extra's minimum version.

coderabbitai · 2026-05-19T00:30:54Z

        - Loader: auto-detected from file extension (PDF or text)
-        - Chunker: FixedSizeChunking(chunk_size=1000)
+        - Chunker: SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)
+          — sentence-aware, never splits entity names at chunk boundaries.
+          Override with ``chunker=FixedSizeChunking(...)`` if you need
+          character-window chunking.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the loader-default docstring to match the new extension routing.

Line 329 still says “PDF or text”, but the default loader now supports DOCX/XLSX/PPTX/HTML/XHTML/CSV as well. This user-facing doc is now stale.

✏️ Proposed docstring fix

- - Loader: auto-detected from file extension (PDF or text) + - Loader: auto-detected from file extension + (PDF, Markdown, DOCX, XLSX, PPTX, HTML/XHTML, CSV, or text fallback)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/api/main.py` around lines 329 - 333, Update the stale loader-default docstring that currently reads "Loader: auto-detected from file extension (PDF or text)" so it reflects the new extension routing; locate the docstring containing that exact phrase in graphrag_sdk/api/main.py (the help/usage text shown in the diff) and replace it with a concise description like "Loader: auto-detected from file extension (PDF, DOCX, XLSX, PPTX, HTML/XHTML, CSV, or plain text)" so the user-facing docs list the supported formats.

coderabbitai · 2026-05-19T00:30:54Z

+        with patch("builtins.__import__", side_effect=lambda name, *args, **kwargs:
+                   (exec('raise ImportError("module not found")') if name == "docling.document_converter" else None)):
+            with pytest.raises(LoaderError, match=r"DOCX parsing requires 'docling'"):
+                await loader.load(str(file), ctx)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n -C3 'patch\("builtins\.__import__"|side_effect=lambda name|else None' graphrag_sdk/tests/test_docling_loaders.py

Repository: FalkorDB/GraphRAG-SDK

Length of output: 515

🏁 Script executed:

# Check for other similar import mocking patterns in the test file rg -n 'patch.*__import__|side_effect.*lambda' graphrag_sdk/tests/test_docling_loaders.py

Repository: FalkorDB/GraphRAG-SDK

Length of output: 152

🏁 Script executed:

# Find the loader implementation to understand what imports happen during load() fd -name '*loader*' -type f graphrag_sdk/ | head -20

Repository: FalkorDB/GraphRAG-SDK

Length of output: 296

🏁 Script executed:

# Check what docling_base_loader.load() actually does rg -n 'def load|async def load' graphrag_sdk/ -A 15 | head -100

Repository: FalkorDB/GraphRAG-SDK

Length of output: 9229

Mocking builtins.__import__ with a lambda that returns None for non-target imports breaks the import system.

The current pattern at lines 24-25 causes any unrelated import during loader.load(...) to fail because builtins.__import__ must return a module object or delegate to the real importer—not None. This can cause test flakiness.

Recommended fix

- with patch("builtins.__import__", side_effect=lambda name, *args, **kwargs: - (exec('raise ImportError("module not found")') if name == "docling.document_converter" else None)): + real_import = __import__ + + def _import(name, *args, **kwargs): + if name == "docling.document_converter": + raise ImportError("module not found") + return real_import(name, *args, **kwargs) + + with patch("builtins.__import__", side_effect=_import): with pytest.raises(LoaderError, match=r"DOCX parsing requires 'docling'"): await loader.load(str(file), ctx)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with patch("builtins.__import__", side_effect=lambda name, *args, **kwargs:

(exec('raise ImportError("module not found")') if name == "docling.document_converter" else None)):

with pytest.raises(LoaderError, match=r"DOCX parsing requires 'docling'"):

await loader.load(str(file), ctx)

real_import = __import__

def _import(name, *args, **kwargs):

if name == "docling.document_converter":

raise ImportError("module not found")

return real_import(name, *args, **kwargs)

with patch("builtins.__import__", side_effect=_import):

with pytest.raises(LoaderError, match=r"DOCX parsing requires 'docling'"):

await loader.load(str(file), ctx)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/tests/test_docling_loaders.py` around lines 24 - 27, The test is breaking the import system by having the patched builtins.__import__ return None for non-target imports; change the patch side_effect to delegate to the real importer for all names except the one you want to simulate failing (i.e., capture the original_import = builtins.__import__ and in the side_effect for the patch used around loader.load call, raise ImportError("module not found") when name == "docling.document_converter" and otherwise return original_import(name, *args, **kwargs)); keep the pytest.raises assertion around await loader.load(str(file), ctx) unchanged and reference the patched builtins.__import__ side_effect that delegates for everything except "docling.document_converter".

Path C in retrieve_chunks used `COLLECT(c)[..3]` with no ORDER BY, so hub entities (which can be MENTIONED_IN hundreds of chunks) returned an arbitrary 3 — almost never including the chunks most relevant to the current query. Add an ORDER BY on `vec.cosineDistance(c.embedding, query_vector)` before the COLLECT so per-entity chunk selection is query-aware. Refs FalkorDB#258 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drr00t and others added 3 commits May 17, 2026 17:10

feat: initial mult-extension loader

31e927e

test(loader): more tests for doclin-base loaders

9bfbab4

drr00t marked this pull request as draft May 19, 2026 00:27

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

drr00t and others added 4 commits May 18, 2026 21:48

fix(issues): from coderabbitai review

857f870

fix(conflict): need to be updated

4af2f8c

Merge branch 'main' into 241-issue-expand-document-loader-coverage

4f7d37c

galshubeli linked an issue May 19, 2026 that may be closed by this pull request

Expand document loader coverage (HTML, Markdown, DOCX, CSV, JSON, URL, S3, Image) #241

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

241 issue expand document loader coverage#262

241 issue expand document loader coverage#262
drr00t wants to merge 7 commits into
FalkorDB:mainfrom
drr00t:241-issue-expand-document-loader-coverage

drr00t commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drr00t commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drr00t commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading