Skip to content

Loader: strip static-site-generator templating (Jekyll/Hugo/etc.) before embedding chunks #258

@galshubeli

Description

@galshubeli

Problem

When ingesting markdown files served by static-site generators (Jekyll, Hugo, MkDocs, Docusaurus), the source .md contains template directives — Liquid tags {% ... %}, expressions {{ ... }}, YAML front matter — that the renderer strips before serving HTML. The SDK ingests them as-is, so they enter the chunker and get embedded alongside actual content.

These directives are semantically noise but dominate the surface form of short chunks. Retrieval recall on heavily-templated pages is significantly worse than it should be.

Evidence

GraphRAG-SDK powers the FalkorDB docs widget (docs.falkordb.com). The docs site is built with Jekyll. getting-started/index.md wraps every install command in {% capture pypi_0 %} pip install falkordb {% endcapture %}{% include code_tabs.html %} blocks.

Raw vector search (db.idx.vector.queryNodes, top-50) for "how to install falkordb?":

install-doc chunk rank cosine
Before preprocessing #35 0.397
After stripping Jekyll #4 0.433

End-to-end widget answers also flipped from "This isn't covered in the FalkorDB docs" → actual install instructions for "I need to install anything?" and "how to install falkordb?".

What we did locally

~10-line cleaner applied before rag.ingest(...):

def strip_jekyll(text: str) -> str:
    if text.startswith("---\n"):
        end = text.find("\n---\n", 4)
        if end != -1:
            text = text[end + 5:]
    text = re.sub(r"\{%-?.*?-?%\}", "", text, flags=re.DOTALL)
    text = re.sub(r"\{\{.*?\}\}",   "", text, flags=re.DOTALL)
    text = re.sub(r"[ \t]+\n", "\n", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

Content inside Liquid tag pairs (e.g. the command between {% capture %} / {% endcapture %}) is kept — only the directives are removed.

Suggested fix

Either:

  1. Preprocessor hook on ingestionrag.ingest(..., preprocess=fn) so consumers can clean content before chunking without subclassing strategies.
  2. Built-in MarkdownLoader with strip_templating=True (Jekyll/Liquid first; extend later to Hugo {{< >}}, MDX, Docusaurus admonitions).

Option 2 fits the loader-coverage theme of #241 and is more discoverable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions