Semantic text chunking using a fine-tuned ModernBERT model with recursive splitting and support for extremely long documents.
This library divides your text into meaningful segments based on semantic boundaries rather than just character counts or newline characters. It uses a token-classification approach where the model predicts the ideal points to "cut" the text.
Library and model are still in early development, so expect some rough edges.
- Fine-tuned ModernBERT: Uses a finetunned ModernBERT encoder model optimized for semantic boundaries. More details about models are provided at: jboksa/modbert-chunker-base
- Recursive Splitting: Automatically drills down into large chunks with decreasing thresholds to ensure everything fits your target size while remaining semantically coherent.
- Long Text Support: Implements an intelligent sliding window system to process documents of any length (books, reports, etc.) without losing context.
- Hugging Face Integration: Zero configuration required - models and tokenizers are fetched automatically from the Hub.
- Hardware Agnostic: Runs smoothly on CUDA (GPU) or CPU.
Basic Installation
To install the fine-chunker package, you can use pip:
pip install fine-chunkerOr using uv:
uv add fine-chunkerDepending on your use case, you may want to install additional dependencies:
-
With PyTorch (GPU support): If you plan to use PyTorch with GPU support, install the package with the
torchextras:pip install fine-chunker[torch]
-
With PyTorch (CPU-only): If you plan to use PyTorch but only need CPU support, install the package with the
torch-cpuextras:pip install fine-chunker[torch-cpu]
-
With ONNX Runtime: If you plan to use ONNX for inference, install the package with the
onnxextras:pip install fine-chunker[onnx]
If you want to contribute to the development of fine-chunker, you can install the package with development dependencies:
pip install fine-chunker[dev]This will include tools for building, testing, and debugging the package.
from fine_chunker import Chunker
text = """
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
"""
chunker = Chunker.from_pretrained(device="cpu", use_onnx=True, max_chunk_size=850)
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"\nChunk {chunk.index} | size={len(chunk.content)}")
print(chunk.content)Result:
Chunk 0 | size=431
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
Chunk 1 | size=759
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time , they generate a sequence of hidden states ht , as a function of the previous hidden state ht −1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths , as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
Chunk 2 | size=731
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks , allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
You can fine-tune the chunking behavior using several parameters:
chunker = Chunker.from_pretrained(
device="cuda",
threshold_start=0.5, # Starting sensitivity (higher = fewer chunks)
threshold_step=0.1, # How much to lower threshold when a chunk is too big
max_chunk_size=1000, # Target maximum characters per chunk
min_chunk_size=350, # Minimum characters (merges small fragments)
max_depth=3 # How many times to try splitting a single big chunk
)- Windowing: If the text is extremely long, it's divided into semantic windows of ~8000 tokens.
- Prediction: The ModernBERT model identifies "start of chunk" tokens.
- Recursive Refinement: If a resulting chunk is larger than
max_chunk_size, the library re-scans just that fragment with a lower sensitivity threshold. - Stability Merge: Finally, very small fragments are merged with their neighbors to maintain a consistent chunk size for your RAG or LLM application.
Developed by Jerzy Boksa.
Contact: devjerzy@gmail.com
Model hosted at: jboksa/modbert-chunker-base