Skip to content

willysk73/docforest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocForest

DocForest is a Python library for intelligently chunking structured documents like Markdown and AsciiDoc. It organizes document content into a recursive, tree-like structure, ensuring that each chunk retains its full contextual path from its parent headings. This makes it an ideal tool for RAG (Retrieval-Augmented Generation) systems, semantic search, and other NLP tasks.


Features

  • Hierarchical Chunking: Splits documents based on heading levels, preserving the logical structure.
  • Context Preservation: Each section's content is linked to all its parent headings, providing rich context.
  • Flexible Output: Generates a structured "forest" or "tree" that is easy to traverse and process.
  • Support for Multiple Formats: Built to handle various structured document types.

Installation

Install docforest from PyPI:

pip install docforest

Usage

from docforest import DocForest, DocStyle

# Create a DocForest instance with the desired document style
forest = DocForest(style=DocStyle.MARKDOWN)

# chunk a document by giving its content
forest.chunk(content="content")

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

A Python package for intelligently chunking structured documents into a hierarchical, contextual tree.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages