diff --git a/README.rst b/README.rst index fcdb26e..a91ec0d 100644 --- a/README.rst +++ b/README.rst @@ -47,9 +47,9 @@ ๐Ÿฒ๐Ÿ“š **Documentation Search Engine** - An intelligent documentation search and extraction tool that provides both a command-line interface for humans and an -MCP (Model Context Protocol) server for AI agents. Search across Sphinx and -MkDocs sites with fuzzy matching, extract clean markdown content, and integrate -seamlessly with AI development workflows. +MCP (Model Context Protocol) server for AI agents. Search across Sphinx, +MkDocs, Pydoctor, and Rustdoc sites with fuzzy matching, extract clean markdown +content, and integrate seamlessly with AI development workflows. Key Features โญ @@ -57,7 +57,7 @@ Key Features โญ * ๐Ÿ” **Universal Search**: Fuzzy, exact, and regex search across documentation inventories and full content * ๐Ÿค– **AI Agent Ready**: Built-in MCP server for seamless integration with Claude Code and other AI tools -* ๐Ÿ“– **Multi-Format Support**: Works with Sphinx (Furo, ReadTheDocs themes) and MkDocs (Material theme) sites +* ๐Ÿ“– **Multi-Format Support**: Works with Sphinx, MkDocs, Pydoctor, and Rustdoc sites * ๐Ÿš€ **High Performance**: In-memory caching with sub-second response times for repeated queries * ๐Ÿงน **Clean Output**: High-quality HTML-to-Markdown conversion preserving code blocks and formatting * ๐ŸŽฏ **Auto-Detection**: Automatically identifies documentation type without manual configuration diff --git a/documentation/architecture/openspec/project.md b/documentation/architecture/openspec/project.md index 3da5119..94681ff 100644 --- a/documentation/architecture/openspec/project.md +++ b/documentation/architecture/openspec/project.md @@ -1,31 +1,58 @@ # Project Context ## Purpose -[Describe your project's purpose and goals] +A Documentation Search Engine providing a CLI and MCP (Model Context Protocol) server for AI agents to search and extract documentation from Sphinx, MkDocs, Pydoctor, and Rustdoc sites. It aims to make technical documentation easily consumable by AI agents. ## Tech Stack -- [List your primary technologies] -- [e.g., TypeScript, React, Node.js] +- **Language**: Python 3.10+ +- **Core Libraries**: `mcp`, `sphobjinv`, `beautifulsoup4`, `lxml`, `rapidfuzz`, `markdownify`, `rich`, `httpx` +- **CLI Framework**: `typer` (via `emcd-appcore`) +- **Build System**: `hatch` +- **Package Management**: `uv` +- **Testing**: `pytest` +- **Linting/Typing**: `ruff`, `isort`, `pyright` ## Project Conventions +### Filesystem Organization +The project follows a standard filesystem organization as detailed in [Filesystem Organization](../filesystem.rst). +- Source code is in `sources/`. +- Tests are in `tests/`. +- Documentation is in `documentation/`. +- The package uses a layered architecture (Interface, Business Logic, Processor, Infrastructure). + ### Code Style -[Describe your code style preferences, formatting rules, and naming conventions] +- **Line Length**: 79 characters. +- **Indentation**: 4 spaces. +- **Imports**: Sorted by `isort`, separated by type (stdlib, third-party, first-party). +- **Typing**: `pyright` is used for static type checking. `mypy` is explicitly avoided. +- **Linting**: `ruff` is used for linting with a specific set of rules (Flake8, Pylint, etc.). +- **Import Hubs**: Uses `__/imports.py` for centralized imports within packages. ### Architecture Patterns -[Document your architectural decisions and patterns] +- **Layered Architecture**: Defined in `filesystem.rst` (Interface, Business Logic, Processor, Extension Management, Infrastructure). +- **Extension System**: Uses a plugin architecture for supporting additional documentation formats. ### Testing Strategy -[Explain your testing approach and requirements] +- **Framework**: `pytest`. +- **Naming**: Test files `test_*.py`, test functions `test_[0-9][0-9][0-9]_*`. +- **Coverage**: tracked via `coverage`, artifacts in `.auxiliary/artifacts/coverage-pytest`. +- **Doctests**: Executed via Sphinx. ### Git Workflow -[Describe your branching strategy and commit conventions] +- **Changelog**: Uses `towncrier` for managing changelog fragments in `documentation/changelog.rst`. Fragments go in `.auxiliary/data/towncrier`. ## Domain Context -[Add domain-specific knowledge that AI assistants need to understand] +- **Sphinx**: Understands `objects.inv` inventory files. +- **MkDocs**: Understands `search_index.json`. +- **Pydoctor**: Understands Pydoctor inventory and structure formats. +- **Rustdoc**: Understands Rustdoc structure and search index. +- **MCP**: Implements Model Context Protocol for AI integration. +- **HTML to Markdown**: Converts documentation HTML to clean Markdown for AI consumption. ## Important Constraints -[List any technical, business, or regulatory constraints] +- **Python Version**: Minimum 3.10. +- **License**: Apache-2.0. ## External Dependencies -[Document key external services, APIs, or systems] +- **Target Documentation Sites**: Requires internet access to fetch documentation from remote URLs. diff --git a/documentation/prd.rst b/documentation/prd.rst index 197b83b..9fcb9a6 100644 --- a/documentation/prd.rst +++ b/documentation/prd.rst @@ -58,7 +58,7 @@ Goals and Objectives =============================================================================== **Primary Objectives (Critical):** -1. **Unified Documentation Access**: Provide consistent interface for both Sphinx and MkDocs documentation sites +1. **Unified Documentation Access**: Provide consistent interface for Sphinx, MkDocs, Pydoctor, and Rustdoc documentation sites 2. **Advanced Search**: Enable fuzzy, exact, and regex-based search across documentation inventories and content 3. **MCP Integration**: Seamless integration with AI agents through Model Context Protocol 4. **Performance**: Fast response times with intelligent caching for frequently accessed documentation @@ -71,7 +71,7 @@ Goals and Objectives **Success Metrics:** - Sub-second response times for cached inventory queries -- Support for 90%+ of popular Sphinx and MkDocs sites +- Support for 90%+ of popular Sphinx, MkDocs, Pydoctor, and Rustdoc sites - Clean markdown output with preserved code formatting - Successful integration with major MCP clients - 90%+ test coverage with comprehensive edge case handling @@ -137,7 +137,31 @@ Functional Requirements - Handle mkdocstrings-specific content structure - Filter out navigation and UI elements during extraction -**REQ-004: Search Functionality (Critical)** +**REQ-004: Pydoctor Documentation Processing (Critical)** +- **Priority**: Critical +- **Description**: Full support for Pydoctor documentation sites +- **User Story**: As a user, I want to search Pydoctor documentation so that I can access API documentation for Twisted and other Zope-stack projects +- **Acceptance Criteria**: + + - Parse objects.inv files from Pydoctor sites + - Extract content from Pydoctor-generated HTML + - Convert HTML to Markdown with language-aware code blocks + - Handle Pydoctor-specific content structure + - Filter out navigation and UI elements during extraction + +**REQ-005: Rustdoc Documentation Processing (Critical)** +- **Priority**: Critical +- **Description**: Full support for Rustdoc documentation sites +- **User Story**: As a user, I want to search Rustdoc documentation so that I can access API documentation for Rust crates +- **Acceptance Criteria**: + + - Parse search-index.js files from Rustdoc sites + - Extract content from Rustdoc-generated HTML + - Convert HTML to Markdown with language-aware code blocks + - Handle Rustdoc-specific content structure + - Filter out navigation and UI elements during extraction + +**REQ-006: Search Functionality (Critical)** - **Priority**: Critical - **Description**: Multiple search modes with configurable behavior - **User Story**: As a user, I want to search documentation using different matching strategies so that I can find relevant content efficiently @@ -150,7 +174,7 @@ Functional Requirements - Filtering by domain, role, and custom processor filters - Configurable result limits and detail levels -**REQ-005: Caching System (High)** +**REQ-007: Caching System (High)** - **Priority**: High - **Description**: Intelligent caching to improve performance and reduce network requests - **User Story**: As a user, I want fast response times for repeated queries so that my workflow is not interrupted @@ -162,7 +186,7 @@ Functional Requirements - Cache hit/miss metrics for optimization - Configurable cache settings -**REQ-006: CLI Interface (High)** +**REQ-008: CLI Interface (High)** - **Priority**: High - **Description**: Human-usable command-line interface for testing and standalone use - **User Story**: As a developer, I want to test librovore functionality from the command line so that I can validate behavior and debug issues @@ -174,7 +198,7 @@ Functional Requirements - Support for all MCP server capabilities - Configuration file support for frequent use cases -**REQ-007: Processor Detection (High)** +**REQ-009: Processor Detection (High)** - **Priority**: High - **Description**: Automatic detection of appropriate processor for given documentation site - **User Story**: As a user, I want the system to automatically determine the correct processor so that I don't need to specify the documentation type @@ -182,11 +206,13 @@ Functional Requirements - Detect Sphinx sites by robots.txt and objects.inv presence - Detect MkDocs sites with mkdocstrings by objects.inv and site structure + - Detect Pydoctor sites by objects.inv and site structure + - Detect Rustdoc sites by search-index.js and site structure - Graceful fallback when detection is ambiguous - Clear error messages when no suitable processor is found - Confidence scoring for processor selection -**REQ-008: Content Quality (Medium)** +**REQ-010: Content Quality (Medium)** - **Priority**: Medium - **Description**: High-quality content extraction and formatting - **User Story**: As a user, I want extracted content to be clean and well-formatted so that it's easily readable and usable @@ -198,7 +224,7 @@ Functional Requirements - Convert HTML tables to Markdown tables - Handle images and media references appropriately -**REQ-009: Error Handling (Medium)** +**REQ-011: Error Handling (Medium)** - **Priority**: Medium - **Description**: Robust error handling and user feedback - **User Story**: As a user, I want clear error messages when something goes wrong so that I can understand and resolve issues @@ -210,7 +236,7 @@ Functional Requirements - Detailed logging for debugging purposes - Recovery from temporary service unavailability -**REQ-010: Plugin Architecture Foundation (Low)** +**REQ-012: Plugin Architecture Foundation (Low)** - **Priority**: Low - **Description**: Extensible architecture for additional documentation processors - **User Story**: As a tool developer, I want to extend the system with custom processors so that I can support additional documentation formats