Loader: strip static-site-generator templating (Jekyll/Hugo/etc.) before embedding chunks

## Problem

When ingesting markdown files served by static-site generators (Jekyll, Hugo, MkDocs, Docusaurus), the source `.md` contains template directives — Liquid tags `{% ... %}`, expressions `{{ ... }}`, YAML front matter — that the renderer strips before serving HTML. The SDK ingests them as-is, so they enter the chunker and get embedded alongside actual content.

These directives are semantically noise but dominate the surface form of short chunks. Retrieval recall on heavily-templated pages is significantly worse than it should be.

## Evidence

GraphRAG-SDK powers the FalkorDB docs widget (docs.falkordb.com). The docs site is built with Jekyll. `getting-started/index.md` wraps every install command in `{% capture pypi_0 %} pip install falkordb {% endcapture %}` … `{% include code_tabs.html %}` blocks.

Raw vector search (`db.idx.vector.queryNodes`, top-50) for `"how to install falkordb?"`:

| | install-doc chunk rank | cosine |
|---|---|---|
| Before preprocessing | **#35** | 0.397 |
| After stripping Jekyll | **#4** | 0.433 |

End-to-end widget answers also flipped from `"This isn't covered in the FalkorDB docs"` → actual install instructions for `"I need to install anything?"` and `"how to install falkordb?"`.

## What we did locally

~10-line cleaner applied before `rag.ingest(...)`:

```python
def strip_jekyll(text: str) -> str:
    if text.startswith("---\n"):
        end = text.find("\n---\n", 4)
        if end != -1:
            text = text[end + 5:]
    text = re.sub(r"\{%-?.*?-?%\}", "", text, flags=re.DOTALL)
    text = re.sub(r"\{\{.*?\}\}",   "", text, flags=re.DOTALL)
    text = re.sub(r"[ \t]+\n", "\n", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()
```

Content inside Liquid tag *pairs* (e.g. the command between `{% capture %}` / `{% endcapture %}`) is kept — only the directives are removed.

## Suggested fix

Either:
1. **Preprocessor hook on ingestion** — `rag.ingest(..., preprocess=fn)` so consumers can clean content before chunking without subclassing strategies.
2. **Built-in `MarkdownLoader`** with `strip_templating=True` (Jekyll/Liquid first; extend later to Hugo `{{< >}}`, MDX, Docusaurus admonitions).

Option 2 fits the loader-coverage theme of #241 and is more discoverable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loader: strip static-site-generator templating (Jekyll/Hugo/etc.) before embedding chunks #258

Problem

Evidence

What we did locally

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	install-doc chunk rank	cosine
Before preprocessing	#35	0.397
After stripping Jekyll	#4	0.433

Loader: strip static-site-generator templating (Jekyll/Hugo/etc.) before embedding chunks #258

Description

Problem

Evidence

What we did locally

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions