Index any documentation locally and use it as a Claude Code skill.
Crawl a docs site, pick the topics you want, and get a local skill that Claude navigates directly — no searching, no guessing.
No extra API calls. No latency. Index your own docs in minutes.
Claude Code can access docs three ways: WebFetch fetches pages directly from the web, MCP servers (like Context7 or docs-mcp-server) expose a search API over pre-indexed docs, and skills (this project) give Claude a local file-based index it can navigate directly.
| WebFetch | MCP Doc Servers | Skills (this project) |
|---|---|---|
|
Direct fetch — Claude fetches a documentation URL and reads the page content.
|
Search API — Claude sends a query to the MCP server, which returns matching results from pre-indexed docs.
|
Local index — Claude reads SKILL.md which lists every available page, then reads the target file directly.
|
Trade-off: One-time crawl (5-30 min) to generate the skill. Re-crawl when docs update — monthly is usually enough.
Add the marketplace (one-time)
claude /plugin marketplace add https://github.com/eneko-codes/knowledgeThen install the doc-indexer plugin:
claude /plugin install doc-indexer@knowledgeThat's it. You can now index any documentation site.
Tell Claude:
Index the React docs
or with a specific URL:
Index the documentation at https://docs.sqlc.dev/en/stable/ for sqlc
If you don't provide a URL, Claude searches for the official docs and confirms with you before crawling.
Claude runs a 7-step pipeline:
- Crawl — visits every page on the docs site using a stealth Chromium browser
- Extract — converts each page to structured markdown with code blocks preserved
- Summarize — shows you what was found, grouped by topic
- You choose — you pick which topics to include from a numbered list
- Filter — Claude reviews each page and removes noise (blog posts, archive listings, empty pages). You approve the final list before proceeding
- Build — assembles the filtered content into a skill with a flat
pages/directory and a rich SKILL.md index listing sub-topics per file - Validate — checks structural integrity, then re-visits every source page and compares against the generated markdown (title, headings, code blocks, content length). Mismatches are flagged with screenshots
Example interaction for a large library:
Extracted 287 pages from Laravel 12 documentation.
Topics detected:
1. Eloquent ORM (42 pages)
2. Routing (18 pages)
3. Authentication (24 pages)
4. Blade Templates (15 pages)
5. Validation (12 pages)
6. Queues & Jobs (16 pages)
...
Which topics do you want to include? (e.g., "1, 2, 3, 5" or "all")
You type 1, 2, 3, 5 and Claude builds a focused skill with just those topics — no bloat.
Choose where to save the generated skill:
| Scope | Output directory | Who gets the docs | Use when |
|---|---|---|---|
| project | <project>/.claude/skills/<name>-docs/ |
Whole team (committed to git) | The library is used by the project |
| user | ~/.claude/skills/<name>-docs/ |
Just you, all projects | General-purpose library you use everywhere |
Skills placed in .claude/skills/ directories are auto-discovered by Claude Code — no registration or installation is needed. Just restart your session.
Multiple versions of the same library coexist side by side. For example, indexing Laravel 11 and Laravel 12 produces separate skills: laravel-11-docs and laravel-12-docs. Claude picks the right version based on context, or you can ask about a specific one.
Tip
Add a CLAUDE.md instruction so Claude always uses the docs skill.
Whether you index docs at user scope or project scope, add a line to your project's CLAUDE.md telling Claude to use it:
When working with React, use the react-docs skill to look up documentation.Without this, Claude may rely on training data instead of the indexed docs. The instruction ensures Claude reaches for the skill first — giving you accurate, version-specific answers every time.
We also recommend disabling other documentation MCP servers you might be using, as they can compete with the docs skill and add unnecessary tool call overhead.
Prerequisites
doc-indexer requires Python 3.8+, Node.js 18+, and downloads a Chromium browser (~200MB) for crawling.
macOS — Python comes pre-installed on macOS 12.3+.
brew install python@3.12 # HomebrewLinux
# Ubuntu / Debian
sudo apt update && sudo apt install python3 python3-venv python3-pip
# Fedora / RHEL
sudo dnf install python3 python3-pip
# Arch
sudo pacman -S python python-pipWindows
winget install Python.Python.3.12 # winget (recommended)
choco install python --version=3.12 # Chocolatey
scoop install python # ScoopPlaywright system libraries (Linux only)
sudo npx playwright install-deps chromiumFirst-time setup
Creates a Python venv, installs Node.js dependencies, and downloads Chromium:
macOS / Linux
cd ~/.claude/plugins/cache/knowledge/*/plugins/doc-indexer/skills/doc-indexer/scripts
bash setup.shWindows (PowerShell)
cd $env:USERPROFILE\.claude\plugins\cache\knowledge\*\plugins\doc-indexer\skills\doc-indexer\scripts
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
playwright install chromium
npm installHow the pipeline works
┌─────────────────────────────────────────────────────────────────────┐
│ DOC-INDEXER PIPELINE │
└─────────────────────────────────────────────────────────────────────┘
┌──────────┐
│ setup.sh │ One-time: create .venv, install Python deps,
└────┬─────┘ download Chromium, npm install Defuddle
│
▼
┌─────────────────┐ Pre-crawl site analysis
│ recon.py │ Inputs: root URL
│ (Step 1b) │ Process: raw HTTP fetch → Playwright render → compare
│ │ → check llms.txt, sitemap.xml → classify site
│ │ Outputs: /tmp/<lib>-recon.json (rendering, discovery
│ │ method, suggested flags, URL patterns)
└────────┬────────┘
│ recon report informs crawl parameters
▼
┌─────────────────┐ BFS traversal with Playwright+stealth
│ crawl.py │ Inputs: root URL, --same-path-prefix, --max-depth,
│ (Step 2) │ --exclude-pattern
│ │ Process: launch Chromium → visit pages → extract links
│ │ → save rendered HTML → checkpoint every 20 pages
│ │ Outputs: /tmp/<lib>-sitemap.json (metadata)
│ │ /tmp/<lib>-html/*.html (rendered pages)
└────────┬────────┘
│ sitemap.json + saved HTML files
▼
┌─────────────────┐ Per-page content extraction
│ extract.py │ Inputs: sitemap.json, /tmp/<lib>-html/
│ (Step 3) │ Process: for each HTML file:
│ │ │ ┌──────────────────────────┐
│ ├───────►│───────│ defuddle_extract.mjs │ Node.js subprocess
│ │ │ │ (Defuddle → markdown) │ per HTML file
│ │ │ └──────────────────────────┘
│ │ → parse markdown: code blocks, headings, sigs
│ │ Outputs: /tmp/<lib>-extracted/*.json (one per page)
└────────┬────────┘
│ extracted JSON files
▼
┌─────────────────┐ Human-in-the-loop curation
│ Claude Agent │ Process: read extracted JSONs → group by topic
│ (Step 4) │ → present to user → user picks topics
│ │ → quality filter (KEEP/SKIP) → user confirms
│ │ → delete skipped JSON files
│ │ Output: filtered /tmp/<lib>-extracted/ (only kept pages)
└────────┬────────┘
│ filtered extracted JSONs
▼
┌─────────────────┐ Assemble skill from filtered content
│ build_plugin.py │ Inputs: library name, extracted dir, version, source URL
│ (Step 5) │ Process: load JSONs → generate pages/*.md from template
│ │ → generate SKILL.md (index + file listing)
│ │ Outputs: <output-dir>/SKILL.md
│ │ <output-dir>/pages/*.md
│ │ Templates used:
│ │ ├── SKILL_template.md
│ │ └── section_template.md
└────────┬────────┘
│ built skill directory
▼
┌─────────────────┐ Structural integrity checks
│ validate.py │ Inputs: skill dir, --extracted-dir
│ (Step 6) │ Checks: SKILL.md frontmatter ✓
│ │ link resolution (all paths exist) ✓
│ │ no empty files ✓
│ │ page count matches extracted ✓
│ │ section coverage ≥ 90% ✓
│ │ signature coverage ≥ 80% ✓
│ │ Output: exit 0 (pass) or exit 1 (fail) + report
└────────┬────────┘
│
▼
┌─────────────────┐ Live accuracy verification
│ verify.py │ Inputs: skill dir, --screenshot-dir
│ (Step 6b) │ Process: for each pages/*.md file:
│ │ → extract source URL from "> Source:" line
│ │ → re-visit live page with Playwright+stealth
│ │ → re-extract via Defuddle for baseline
│ │ → compare signals (title, headings, code, length)
│ │ → screenshot mismatched pages
│ │ Output: exit 0 (pass) or exit 1 (mismatches) + report
└────────┬────────┘
│
▼
┌──────────┐
│ FINALIZE │ Skill is auto-discovered from .claude/skills/
│ (Step 7) │ cleanup.sh removes /tmp/ files
│ │ teardown.sh optionally removes .venv + Chromium
└──────────┘
crawl.py — Stealth Chromium browser with anti-fingerprint patches. BFS-crawls from the root URL and saves the HTML to disk. Each page is visited only once — extract.py reuses the saved HTML.
extract.py — Extracts main content from saved HTML using Defuddle for algorithmic content detection with code block standardization (language detection, line number removal, toolbar cleanup). No browser needed — works fully offline. Resumable — skips already-extracted pages on re-run.
Claude review — Reads extracted content, groups by topic, asks user which topics to keep. Then filters out noise: blog posts, archive listings, empty pages, duplicates. User approves the final list.
build_plugin.py — Writes all pages into a flat pages/ directory. Generates SKILL.md index with H2 sub-topic descriptions per file. Outputs the skill directory (SKILL.md + pages/) directly to the specified --output-dir.
validate.py — Structural checks: SKILL.md frontmatter, file paths resolve, no empty files.
verify.py — Accuracy check: re-visits every source URL with Playwright, compares title, heading count, code block count, and content length against the generated markdown. Full-page screenshots of mismatched pages.
CLI Reference
recon.py — Analyze a documentation site before crawling
python3 recon.py <root-url> [options]
--output FILE Output report path (default: /tmp/recon.json)
--timeout SECS Max total runtime in seconds (default: 30)
crawl.py — Discover documentation pages via BFS crawl
python3 crawl.py <root-url> [options]
python3 crawl.py --from-urls <file> [options]
--output FILE Output sitemap path (default: sitemap.json)
--max-depth N Max link-follow depth from root (default: 10)
--max-pages N Stop after N pages, 0 = unlimited (default: 0)
--delay SECS Base delay between requests (default: 0.5)
--same-path-prefix Only follow links under the root URL's path
--exclude-pattern REGEX Skip URLs matching regex (repeatable)
--from-urls FILE Fetch URLs from file instead of BFS crawling
extract.py — Convert saved HTML to structured markdown
python3 extract.py <sitemap.json> [options]
--output DIR Output directory for JSON files (default: extracted/)
--force Re-extract even if output file already exists
build_plugin.py — Assemble extracted content into a documentation skill
python3 build_plugin.py <library-name> <extracted-dir> [options]
--version LABEL Documentation version label (default: latest)
--source-url URL Original documentation URL
--output-dir DIR Skill output directory (required)
validate.py — Check skill structural integrity
python3 validate.py <skill-dir> [options]
--extracted-dir DIR Cross-reference against filtered extracted JSON files
verify.py — Compare generated content against live source pages
python3 verify.py <skill-dir> [options]
--delay SECS Base delay between requests (default: 0.5)
--screenshot-dir DIR Save full-page screenshots of mismatched pages
cleanup.sh — Remove temporary files after successful indexing
bash cleanup.sh <library-name>
teardown.sh — Remove venv, node_modules, and Chromium (~300MB)
bash teardown.sh
Troubleshooting
Playwright fails to install Chromium — Ensure internet access and ~200MB disk space.
Crawl gets blocked (403/429) — Increase the delay: python3 crawl.py <url> --delay 3.0
Empty markdown extraction — Defuddle failed to extract content. This is rare but can happen with very unusual HTML structures. Open an issue with the URL.
MIT License