DocForest is a Python library for intelligently chunking structured documents like Markdown and AsciiDoc. It organizes document content into a recursive, tree-like structure, ensuring that each chunk retains its full contextual path from its parent headings. This makes it an ideal tool for RAG (Retrieval-Augmented Generation) systems, semantic search, and other NLP tasks.
- Hierarchical Chunking: Splits documents based on heading levels, preserving the logical structure.
- Context Preservation: Each section's content is linked to all its parent headings, providing rich context.
- Flexible Output: Generates a structured "forest" or "tree" that is easy to traverse and process.
- Support for Multiple Formats: Built to handle various structured document types.
Install docforest from PyPI:
pip install docforestfrom docforest import DocForest, DocStyle
# Create a DocForest instance with the desired document style
forest = DocForest(style=DocStyle.MARKDOWN)
# chunk a document by giving its content
forest.chunk(content="content")This project is licensed under the MIT License. See the LICENSE file for details.