Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,17 @@

🐲📚 **Documentation Search Engine** - An intelligent documentation search and
extraction tool that provides both a command-line interface for humans and an
MCP (Model Context Protocol) server for AI agents. Search across Sphinx and
MkDocs sites with fuzzy matching, extract clean markdown content, and integrate
seamlessly with AI development workflows.
MCP (Model Context Protocol) server for AI agents. Search across Sphinx,
MkDocs, Pydoctor, and Rustdoc sites with fuzzy matching, extract clean markdown
content, and integrate seamlessly with AI development workflows.


Key Features ⭐
===============================================================================

* 🔍 **Universal Search**: Fuzzy, exact, and regex search across documentation inventories and full content
* 🤖 **AI Agent Ready**: Built-in MCP server for seamless integration with Claude Code and other AI tools
* 📖 **Multi-Format Support**: Works with Sphinx (Furo, ReadTheDocs themes) and MkDocs (Material theme) sites
* 📖 **Multi-Format Support**: Works with Sphinx, MkDocs, Pydoctor, and Rustdoc sites
* 🚀 **High Performance**: In-memory caching with sub-second response times for repeated queries
* 🧹 **Clean Output**: High-quality HTML-to-Markdown conversion preserving code blocks and formatting
* 🎯 **Auto-Detection**: Automatically identifies documentation type without manual configuration
Expand Down
47 changes: 37 additions & 10 deletions documentation/architecture/openspec/project.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,58 @@
# Project Context

## Purpose
[Describe your project's purpose and goals]
A Documentation Search Engine providing a CLI and MCP (Model Context Protocol) server for AI agents to search and extract documentation from Sphinx, MkDocs, Pydoctor, and Rustdoc sites. It aims to make technical documentation easily consumable by AI agents.

## Tech Stack
- [List your primary technologies]
- [e.g., TypeScript, React, Node.js]
- **Language**: Python 3.10+
- **Core Libraries**: `mcp`, `sphobjinv`, `beautifulsoup4`, `lxml`, `rapidfuzz`, `markdownify`, `rich`, `httpx`
- **CLI Framework**: `typer` (via `emcd-appcore`)
- **Build System**: `hatch`
- **Package Management**: `uv`
- **Testing**: `pytest`
- **Linting/Typing**: `ruff`, `isort`, `pyright`

## Project Conventions

### Filesystem Organization
The project follows a standard filesystem organization as detailed in [Filesystem Organization](../filesystem.rst).
- Source code is in `sources/`.
- Tests are in `tests/`.
- Documentation is in `documentation/`.
- The package uses a layered architecture (Interface, Business Logic, Processor, Infrastructure).

### Code Style
[Describe your code style preferences, formatting rules, and naming conventions]
- **Line Length**: 79 characters.
- **Indentation**: 4 spaces.
- **Imports**: Sorted by `isort`, separated by type (stdlib, third-party, first-party).
- **Typing**: `pyright` is used for static type checking. `mypy` is explicitly avoided.
- **Linting**: `ruff` is used for linting with a specific set of rules (Flake8, Pylint, etc.).
- **Import Hubs**: Uses `__/imports.py` for centralized imports within packages.

### Architecture Patterns
[Document your architectural decisions and patterns]
- **Layered Architecture**: Defined in `filesystem.rst` (Interface, Business Logic, Processor, Extension Management, Infrastructure).
- **Extension System**: Uses a plugin architecture for supporting additional documentation formats.

### Testing Strategy
[Explain your testing approach and requirements]
- **Framework**: `pytest`.
- **Naming**: Test files `test_*.py`, test functions `test_[0-9][0-9][0-9]_*`.
- **Coverage**: tracked via `coverage`, artifacts in `.auxiliary/artifacts/coverage-pytest`.
- **Doctests**: Executed via Sphinx.

### Git Workflow
[Describe your branching strategy and commit conventions]
- **Changelog**: Uses `towncrier` for managing changelog fragments in `documentation/changelog.rst`. Fragments go in `.auxiliary/data/towncrier`.

## Domain Context
[Add domain-specific knowledge that AI assistants need to understand]
- **Sphinx**: Understands `objects.inv` inventory files.
- **MkDocs**: Understands `search_index.json`.
- **Pydoctor**: Understands Pydoctor inventory and structure formats.
- **Rustdoc**: Understands Rustdoc structure and search index.
- **MCP**: Implements Model Context Protocol for AI integration.
- **HTML to Markdown**: Converts documentation HTML to clean Markdown for AI consumption.

## Important Constraints
[List any technical, business, or regulatory constraints]
- **Python Version**: Minimum 3.10.
- **License**: Apache-2.0.

## External Dependencies
[Document key external services, APIs, or systems]
- **Target Documentation Sites**: Requires internet access to fetch documentation from remote URLs.
44 changes: 35 additions & 9 deletions documentation/prd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Goals and Objectives
===============================================================================

**Primary Objectives (Critical):**
1. **Unified Documentation Access**: Provide consistent interface for both Sphinx and MkDocs documentation sites
1. **Unified Documentation Access**: Provide consistent interface for Sphinx, MkDocs, Pydoctor, and Rustdoc documentation sites
2. **Advanced Search**: Enable fuzzy, exact, and regex-based search across documentation inventories and content
3. **MCP Integration**: Seamless integration with AI agents through Model Context Protocol
4. **Performance**: Fast response times with intelligent caching for frequently accessed documentation
Expand All @@ -71,7 +71,7 @@ Goals and Objectives

**Success Metrics:**
- Sub-second response times for cached inventory queries
- Support for 90%+ of popular Sphinx and MkDocs sites
- Support for 90%+ of popular Sphinx, MkDocs, Pydoctor, and Rustdoc sites
- Clean markdown output with preserved code formatting
- Successful integration with major MCP clients
- 90%+ test coverage with comprehensive edge case handling
Expand Down Expand Up @@ -137,7 +137,31 @@ Functional Requirements
- Handle mkdocstrings-specific content structure
- Filter out navigation and UI elements during extraction

**REQ-004: Search Functionality (Critical)**
**REQ-004: Pydoctor Documentation Processing (Critical)**
- **Priority**: Critical
- **Description**: Full support for Pydoctor documentation sites
- **User Story**: As a user, I want to search Pydoctor documentation so that I can access API documentation for Twisted and other Zope-stack projects
- **Acceptance Criteria**:

- Parse objects.inv files from Pydoctor sites
- Extract content from Pydoctor-generated HTML
- Convert HTML to Markdown with language-aware code blocks
- Handle Pydoctor-specific content structure
- Filter out navigation and UI elements during extraction

**REQ-005: Rustdoc Documentation Processing (Critical)**
- **Priority**: Critical
- **Description**: Full support for Rustdoc documentation sites
- **User Story**: As a user, I want to search Rustdoc documentation so that I can access API documentation for Rust crates
- **Acceptance Criteria**:

- Parse search-index.js files from Rustdoc sites
- Extract content from Rustdoc-generated HTML
- Convert HTML to Markdown with language-aware code blocks
- Handle Rustdoc-specific content structure
- Filter out navigation and UI elements during extraction

**REQ-006: Search Functionality (Critical)**
- **Priority**: Critical
- **Description**: Multiple search modes with configurable behavior
- **User Story**: As a user, I want to search documentation using different matching strategies so that I can find relevant content efficiently
Expand All @@ -150,7 +174,7 @@ Functional Requirements
- Filtering by domain, role, and custom processor filters
- Configurable result limits and detail levels

**REQ-005: Caching System (High)**
**REQ-007: Caching System (High)**
- **Priority**: High
- **Description**: Intelligent caching to improve performance and reduce network requests
- **User Story**: As a user, I want fast response times for repeated queries so that my workflow is not interrupted
Expand All @@ -162,7 +186,7 @@ Functional Requirements
- Cache hit/miss metrics for optimization
- Configurable cache settings

**REQ-006: CLI Interface (High)**
**REQ-008: CLI Interface (High)**
- **Priority**: High
- **Description**: Human-usable command-line interface for testing and standalone use
- **User Story**: As a developer, I want to test librovore functionality from the command line so that I can validate behavior and debug issues
Expand All @@ -174,19 +198,21 @@ Functional Requirements
- Support for all MCP server capabilities
- Configuration file support for frequent use cases

**REQ-007: Processor Detection (High)**
**REQ-009: Processor Detection (High)**
- **Priority**: High
- **Description**: Automatic detection of appropriate processor for given documentation site
- **User Story**: As a user, I want the system to automatically determine the correct processor so that I don't need to specify the documentation type
- **Acceptance Criteria**:

- Detect Sphinx sites by robots.txt and objects.inv presence
- Detect MkDocs sites with mkdocstrings by objects.inv and site structure
- Detect Pydoctor sites by objects.inv and site structure
- Detect Rustdoc sites by search-index.js and site structure
- Graceful fallback when detection is ambiguous
- Clear error messages when no suitable processor is found
- Confidence scoring for processor selection

**REQ-008: Content Quality (Medium)**
**REQ-010: Content Quality (Medium)**
- **Priority**: Medium
- **Description**: High-quality content extraction and formatting
- **User Story**: As a user, I want extracted content to be clean and well-formatted so that it's easily readable and usable
Expand All @@ -198,7 +224,7 @@ Functional Requirements
- Convert HTML tables to Markdown tables
- Handle images and media references appropriately

**REQ-009: Error Handling (Medium)**
**REQ-011: Error Handling (Medium)**
- **Priority**: Medium
- **Description**: Robust error handling and user feedback
- **User Story**: As a user, I want clear error messages when something goes wrong so that I can understand and resolve issues
Expand All @@ -210,7 +236,7 @@ Functional Requirements
- Detailed logging for debugging purposes
- Recovery from temporary service unavailability

**REQ-010: Plugin Architecture Foundation (Low)**
**REQ-012: Plugin Architecture Foundation (Low)**
- **Priority**: Low
- **Description**: Extensible architecture for additional documentation processors
- **User Story**: As a tool developer, I want to extend the system with custom processors so that I can support additional documentation formats
Expand Down