Problem
When ingesting markdown files served by static-site generators (Jekyll, Hugo, MkDocs, Docusaurus), the source .md contains template directives — Liquid tags {% ... %}, expressions {{ ... }}, YAML front matter — that the renderer strips before serving HTML. The SDK ingests them as-is, so they enter the chunker and get embedded alongside actual content.
These directives are semantically noise but dominate the surface form of short chunks. Retrieval recall on heavily-templated pages is significantly worse than it should be.
Evidence
GraphRAG-SDK powers the FalkorDB docs widget (docs.falkordb.com). The docs site is built with Jekyll. getting-started/index.md wraps every install command in {% capture pypi_0 %} pip install falkordb {% endcapture %} … {% include code_tabs.html %} blocks.
Raw vector search (db.idx.vector.queryNodes, top-50) for "how to install falkordb?":
|
install-doc chunk rank |
cosine |
| Before preprocessing |
#35 |
0.397 |
| After stripping Jekyll |
#4 |
0.433 |
End-to-end widget answers also flipped from "This isn't covered in the FalkorDB docs" → actual install instructions for "I need to install anything?" and "how to install falkordb?".
What we did locally
~10-line cleaner applied before rag.ingest(...):
def strip_jekyll(text: str) -> str:
if text.startswith("---\n"):
end = text.find("\n---\n", 4)
if end != -1:
text = text[end + 5:]
text = re.sub(r"\{%-?.*?-?%\}", "", text, flags=re.DOTALL)
text = re.sub(r"\{\{.*?\}\}", "", text, flags=re.DOTALL)
text = re.sub(r"[ \t]+\n", "\n", text)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
Content inside Liquid tag pairs (e.g. the command between {% capture %} / {% endcapture %}) is kept — only the directives are removed.
Suggested fix
Either:
- Preprocessor hook on ingestion —
rag.ingest(..., preprocess=fn) so consumers can clean content before chunking without subclassing strategies.
- Built-in
MarkdownLoader with strip_templating=True (Jekyll/Liquid first; extend later to Hugo {{< >}}, MDX, Docusaurus admonitions).
Option 2 fits the loader-coverage theme of #241 and is more discoverable.
Problem
When ingesting markdown files served by static-site generators (Jekyll, Hugo, MkDocs, Docusaurus), the source
.mdcontains template directives — Liquid tags{% ... %}, expressions{{ ... }}, YAML front matter — that the renderer strips before serving HTML. The SDK ingests them as-is, so they enter the chunker and get embedded alongside actual content.These directives are semantically noise but dominate the surface form of short chunks. Retrieval recall on heavily-templated pages is significantly worse than it should be.
Evidence
GraphRAG-SDK powers the FalkorDB docs widget (docs.falkordb.com). The docs site is built with Jekyll.
getting-started/index.mdwraps every install command in{% capture pypi_0 %} pip install falkordb {% endcapture %}…{% include code_tabs.html %}blocks.Raw vector search (
db.idx.vector.queryNodes, top-50) for"how to install falkordb?":End-to-end widget answers also flipped from
"This isn't covered in the FalkorDB docs"→ actual install instructions for"I need to install anything?"and"how to install falkordb?".What we did locally
~10-line cleaner applied before
rag.ingest(...):Content inside Liquid tag pairs (e.g. the command between
{% capture %}/{% endcapture %}) is kept — only the directives are removed.Suggested fix
Either:
rag.ingest(..., preprocess=fn)so consumers can clean content before chunking without subclassing strategies.MarkdownLoaderwithstrip_templating=True(Jekyll/Liquid first; extend later to Hugo{{< >}}, MDX, Docusaurus admonitions).Option 2 fits the loader-coverage theme of #241 and is more discoverable.