Skip to content

JerzyCode/fine-chunker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fine-chunker 🚀

Semantic text chunking using a fine-tuned ModernBERT model with recursive splitting and support for extremely long documents.

This library divides your text into meaningful segments based on semantic boundaries rather than just character counts or newline characters. It uses a token-classification approach where the model predicts the ideal points to "cut" the text.

Library and model are still in early development, so expect some rough edges.

Key Features

  • Fine-tuned ModernBERT: Uses a finetunned ModernBERT encoder model optimized for semantic boundaries. More details about models are provided at: jboksa/modbert-chunker-base
  • Recursive Splitting: Automatically drills down into large chunks with decreasing thresholds to ensure everything fits your target size while remaining semantically coherent.
  • Long Text Support: Implements an intelligent sliding window system to process documents of any length (books, reports, etc.) without losing context.
  • Hugging Face Integration: Zero configuration required - models and tokenizers are fetched automatically from the Hub.
  • Hardware Agnostic: Runs smoothly on CUDA (GPU) or CPU.

Installation

Basic Installation

To install the fine-chunker package, you can use pip:

pip install fine-chunker

Or using uv:

uv add fine-chunker

Optional Dependencies

Depending on your use case, you may want to install additional dependencies:

  1. With PyTorch (GPU support): If you plan to use PyTorch with GPU support, install the package with the torch extras:

    pip install fine-chunker[torch]
  2. With PyTorch (CPU-only): If you plan to use PyTorch but only need CPU support, install the package with the torch-cpu extras:

    pip install fine-chunker[torch-cpu]
  3. With ONNX Runtime: If you plan to use ONNX for inference, install the package with the onnx extras:

    pip install fine-chunker[onnx]

Development Installation

If you want to contribute to the development of fine-chunker, you can install the package with development dependencies:

pip install fine-chunker[dev]

This will include tools for building, testing, and debugging the package.

Quick Start

from fine_chunker import Chunker

text = """
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
    """

chunker = Chunker.from_pretrained(device="cpu", use_onnx=True, max_chunk_size=850)
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"\nChunk {chunk.index} | size={len(chunk.content)}")
    print(chunk.content)

Result:

Chunk 0 | size=431
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

Chunk 1 | size=759
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time , they generate a sequence of hidden states ht , as a function of the previous hidden state ht −1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths , as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

Chunk 2 | size=731
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks , allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

Advanced Usage

You can fine-tune the chunking behavior using several parameters:

chunker = Chunker.from_pretrained(
    device="cuda",
    threshold_start=0.5,  # Starting sensitivity (higher = fewer chunks)
    threshold_step=0.1,   # How much to lower threshold when a chunk is too big
    max_chunk_size=1000,  # Target maximum characters per chunk
    min_chunk_size=350,   # Minimum characters (merges small fragments)
    max_depth=3           # How many times to try splitting a single big chunk
)

How it Works

  1. Windowing: If the text is extremely long, it's divided into semantic windows of ~8000 tokens.
  2. Prediction: The ModernBERT model identifies "start of chunk" tokens.
  3. Recursive Refinement: If a resulting chunk is larger than max_chunk_size, the library re-scans just that fragment with a lower sensitivity threshold.
  4. Stability Merge: Finally, very small fragments are merged with their neighbors to maintain a consistent chunk size for your RAG or LLM application.

Author

Developed by Jerzy Boksa.

Contact: devjerzy@gmail.com

Model hosted at: jboksa/modbert-chunker-base

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages