English | 简体中文
Turn your local folder into a structured, queryable knowledge base — powered by AI agents.
Most document management tools either lock you into a cloud service or require a vector database. local-knowledge-base takes a different approach: it gives your AI agent (Claude Code, Codex, OpenClaw, etc.) the ability to ingest documents, maintain a navigable index, and answer questions — all from plain files on your machine.
| Feature | What it does | Why it matters |
|---|---|---|
| Document Conversion | Converts DOCX, DOC, PDF, PPTX, PPT → Markdown | Preserves structure (tables, lists, images), not just raw text |
| Scanned PDF Detection | Identifies scanned PDFs and routes to OCR | No more silent garbage output from image-based pages |
| Excel Smart Routing | Analyzes spreadsheet complexity before processing | Simple tables get fast Pandas parsing; complex reports get semantic HTML |
| 5-Level Q&A Chain | FAQ → README navigation → targeted reading → keyword search → BADCASE logging | Fast answers first, full-text search as last resort |
| KB Initialization & Migration | Sets up directory structure, moves existing KBs | One command to start; non-destructive migration |
┌─────────────────────────────────────────────┐
│ Your AI Agent (host app) │
└──────────────────┬──────────────────────────┘
│ installs & invokes
▼
┌──────────────────────────────────────────────────────────┐
│ local-knowledge-base Skill │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ Ingestion │ │ Q&A Engine │ │ KB Management │ │
│ │ │ │ │ │ │ │
│ │ DOCX ───┐│ │ FAQ ────────┐│ │ Init / Migrate / │ │
│ │ PDF ───┤│ │ README nav ─┤│ │ Config │ │
│ │ PPTX ───┤│ │ File read ──┤│ └────────────────────┘ │
│ │ Excel ──┘│ │ Grep ───────┤│ │
│ └──────────┘ │ BADCASE ────┘│ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
│ reads & writes
▼
┌──────────────────────────┐
│ ~/your-knowledge-base │
│ │
│ README.md (index) │
│ FAQ.md (Q&A pairs) │
│ BADCASE.md (gaps) │
│ docs/ (content) │
└──────────────────────────┘
The Skill is a plugin for AI agents, not a standalone CLI app. It teaches your agent how to manage a knowledge base through structured workflows and Python scripts. Think of it as giving your agent a new professional capability.
# Clone the repository
git clone https://github.com/Harryoung/local-knowledge-base.git
# The Skill is the local-knowledge-base/ subdirectory.
# Point your AI client to this folder as a local Skill.Works with any client that supports the Skill format: Claude Code, Codex, OpenClaw, and others.
Once installed, just talk to your agent naturally:
- "Set up a knowledge base at ~/work/kb"
- "Ingest this PDF into the knowledge base"
- "What does our onboarding doc say about vacation policy?"
- "Move the knowledge base to a new folder"
The Skill handles format conversion, conflict detection, index maintenance, and retrieval automatically.
| Format | Method | Notes |
|---|---|---|
| DOCX | Pandoc | Full structure preservation |
| DOC | LibreOffice → Pandoc | Converts to DOCX first |
| PDF (digital) | PyMuPDF4LLM | Fast, high-fidelity extraction |
| PDF (scanned) | Detected → paddleocr-doc-parsing OCR routing |
Returns needs_ocr: true; OCR depends on https://clawhub.ai/Bobholamovic/paddleocr-doc-parsing |
| PPTX | pptx2md | Preserves slide structure and speaker notes |
| PPT | LibreOffice → pptx2md | Converts to PPTX first |
| Excel | Complexity analyzer | Routes to Pandas (simple) or HTML semantic mode (complex) |
The repo separates the Skill runtime from development files:
.
├── local-knowledge-base/ ← The Skill (what gets installed)
│ ├── SKILL.md Entry point & workflow definitions
│ ├── requirements.txt Runtime dependencies
│ ├── scripts/ Python scripts (convert, analyze, init)
│ ├── assets/ Templates
│ └── references/ Detailed workflow documentation
│
├── tests/ ← Unit tests (not part of the Skill)
├── .github/workflows/ci.yml ← CI pipeline
├── pyproject.toml ← Project metadata
└── requirements-dev.txt ← Dev dependencies
This means packaging is trivial — just archive local-knowledge-base/ and you have a clean Skill bundle with zero repo noise.
A few choices that set this project apart:
-
Scanned PDF honesty. Instead of silently producing empty or garbled Markdown, scanned PDFs are detected and explicitly flagged. In this project, OCR for scanned PDFs explicitly depends on
https://clawhub.ai/Bobholamovic/paddleocr-doc-parsingrather than an unspecified OCR tool. -
Excel complexity routing. Not all spreadsheets are equal. A 10,000-row data table and a financial report with merged cells need completely different parsing strategies. The complexity analyzer decides before processing begins.
-
Semantic conflict detection. When ingesting a new document, duplicates are checked by content meaning, not just filename. Two files named differently but covering the same topic get caught.
-
Atomic file updates. FAQ, BADCASE, and README files are never partially overwritten. The full content is prepared in memory and replaced atomically to prevent corruption.
-
Speed-first retrieval. The Q&A chain checks FAQ and README navigation before doing any file reads or grep searches. Most questions are answered without scanning the full corpus.
# Install dev dependencies
python -m pip install -r requirements-dev.txt
# Run tests
python -m unittest discover -s tests -v
# Syntax check
python -m py_compile local-knowledge-base/scripts/*.py tests/*.pypython - <<'PY'
from pathlib import Path
import shutil, zipfile
root = Path.cwd().resolve()
skill_dir = root / "local-knowledge-base"
dist = root / "dist"
dist.mkdir(exist_ok=True)
zip_path = dist / "local-knowledge-base.zip"
skill_path = dist / "local-knowledge-base.skill"
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED) as zf:
for path in sorted(skill_dir.rglob("*")):
if path.is_file() and "__pycache__" not in path.parts:
zf.write(path, path.relative_to(root))
shutil.copyfile(zip_path, skill_path)
print(f"Created: {zip_path}")
print(f"Created: {skill_path}")
PYExtracted from Harryoung/efka. If you want the full agent system, start there. If you just want the knowledge-base capability as a reusable Skill, this is the lighter entry point.