Skip to content

Bump kreuzberg from 4.2.9 to 4.4.2#78

Open
dependabot[bot] wants to merge 1 commit intomasterfrom
dependabot/uv/kreuzberg-4.4.2
Open

Bump kreuzberg from 4.2.9 to 4.4.2#78
dependabot[bot] wants to merge 1 commit intomasterfrom
dependabot/uv/kreuzberg-4.4.2

Conversation

@dependabot
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Mar 6, 2026

Bumps kreuzberg from 4.2.9 to 4.4.2.

Release notes

Sourced from kreuzberg's releases.

Release v4.4.2

Fixed

  • E2E element type assertions: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C#
  • Ruby PDF annotation extraction: Fixed PdfAnnotation and PdfAnnotationBoundingBox autoload and bounding box field name mismatch
  • WASM OCR blocking event loop: OCR now runs in a worker thread, keeping the main thread responsive
  • JPEG 2000 OCR decode failure: Shared load_image_for_ocr() helper with hayro-jpeg2000/hayro-jbig2 decoders across all OCR backends
  • WASM PDF empty content: PDFium initialization now properly awaited during initWasm()

Added

  • OMML-to-LaTeX math conversion for DOCX: Mathematical equations converted to LaTeX notation
  • Plain text output paths for all extractors: DOCX, PPTX, ODT, FB2, DocBook, RTF, Jupyter produce clean plain text when requested
  • cells_to_text() shared utility: Tab-separated plain text table formatter

Changed

  • CLI includes all features: kreuzberg-cli now uses full feature set including archives

See CHANGELOG.md for full details.

v4.4.1

Added

  • OCR table inlining into markdown content (#421): When output_format = Markdown and OCR detects tables, the markdown pipe tables are now inlined into result.content at their correct vertical positions instead of only appearing in result.tables. Adds OcrTableBoundingBox to OcrTable for spatial positioning. Sets metadata.output_format = "markdown" to signal pre-formatted content and skip re-conversion.
  • OCR table bounding boxes: OCR-detected tables now include bounding box coordinates (pixel-level) computed from TSV word positions, propagated through all bindings as Table.bounding_box.
  • OCR table test images: Added balance sheet and financial table test images from issue #421 for integration testing.

Fixed

  • OCR test_tsv_row_to_element used wrong Tesseract level: Test specified level: 4 (Line) but asserted Word. Fixed to level: 5 (correct Tesseract word level).
  • MSG recipients missing email addresses: The MSG extractor read PR_DISPLAY_TO which contains only display names (e.g. "John Jennings"), losing email addresses entirely. Now reads recipient substorages (__recip_version1.0_#XXXXXXXX) with PR_EMAIL_ADDRESS and PR_RECIPIENT_TYPE to produce full "Name" <email> output with correct To/CC/BCC separation.
  • MSG date missing or incorrect: Date was parsed from PR_TRANSPORT_MESSAGE_HEADERS which is absent in many MSG files. Now reads PR_CLIENT_SUBMIT_TIME FILETIME directly from the MAPI properties stream, with fallback to transport headers.
  • EML date mangled for non-standard formats: mail_parser parsed ISO 8601 dates (e.g. 2025-07-29T12:42:06.000Z) into garbled output (2000-00-20T00:00:00Z) and replaced invalid dates with 2000-00-00T00:00:00Z. Now extracts the raw Date: header text from the email bytes, preserving the original value.
  • EML/MSG attachments line pollutes text output: build_email_text_output() appended an Attachments: ... line that doesn't represent message content. Removed from text output; attachment names remain in metadata.
  • HTML script/style tags leak in email fallback: The regex-based HTML cleaner for email bodies used .*? which doesn't match across newlines, allowing multiline <script>/<style> content to leak into extracted text. Added (?s) flag for dotall matching.
  • SVG CData content leaks JavaScript/CSS: Event::CData handler in the XML extractor didn't check SVG mode, causing <script> and <style> CDATA blocks to appear in SVG text output.
  • RTF parser leaks metadata noise into text: The RTF extractor did not skip known destination groups (fonttbl, stylesheet, colortbl, info, themedata, etc.) or ignorable destinations ({\*\...}), causing ~17KB of font tables, color definitions, and internal metadata to appear in extracted text.
  • RTF \u control word mishandled: Control words like \ul (underline) and \uc1 were incorrectly interpreted as Unicode escapes (\u + numeric param), producing garbage characters instead of being treated as formatting commands.
  • RTF paragraph breaks collapsed to spaces: \par control words emitted a single space instead of newlines, causing all paragraphs to merge into a single line. Now correctly emits double newlines for paragraph separation.
  • RTF whitespace normalization destroys paragraph structure: normalize_whitespace() treated newlines as whitespace and collapsed them to spaces. Rewritten to preserve newlines while collapsing runs of spaces within lines.

v4.4.0

Added

  • R language bindings -- Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets.
  • PHP async extraction: Non-blocking extraction via DeferredResult pattern with Tokio thread pool. Includes extractFileAsync(), extractBytesAsync(), batchExtractFilesAsync(), batchExtractBytesAsync() across OOP, procedural, and static APIs. Framework bridges for Amp v3+ (AmpBridge) and ReactPHP (ReactBridge).
  • C FFI distribution: Official C shared library (libkreuzberg) with cbindgen-generated header, cmake packaging (find_package(kreuzberg)), pkg-config support, and prebuilt binaries for Linux x86_64/aarch64, macOS arm64, and Windows x86_64. Includes full API reference documentation and test coverage.
  • Go FFI bindings: Go package (packages/go/v4) consuming the C FFI shared library with prebuilt binaries published as GitHub release assets for all four platforms.
  • C as 13th e2e test language: The e2e-generator now produces C test files exercising the FFI API, with 15 passing test cases.

... (truncated)

Changelog

Sourced from kreuzberg's changelog.

[4.4.2]

Fixed

  • E2E element type assertions: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C#. Each binding uses different casing conventions (Python: dict key element_type, TypeScript/Node: elementType via NAPI camelCase, Elixir: atom-to-string conversion, C#: JSON serialization for snake_case wire value).
  • Ruby PDF annotation extraction: Fixed PdfAnnotation and PdfAnnotationBoundingBox classes not being registered in the autoload list, causing NameError when extracting PDF annotations. Also fixed bounding box field name mismatch between Rust output (x0/y0/x1/y1) and Ruby struct (left/top/right/bottom).
  • Ruby cyclomatic complexity: Refactored build_annotation_bbox in result.rb to extract repeated field lookup pattern, reducing cyclomatic complexity below threshold.
  • WASM OCR blocking event loop: The ocrRecognize() function in the WASM package was running synchronously on the main thread, blocking the Node.js event loop during image decoding and Tesseract OCR processing. This prevented timeouts and other async operations from firing while OCR was in progress. OCR now runs in a worker thread (Node.js worker_threads / browser Web Worker), keeping the main thread responsive.
  • JPEG 2000 OCR decode failure: JPEG 2000 images (jp2, jpx, jpm, mj2) and JBIG2 images failed with "The image format could not be determined" during PaddleOCR and WASM OCR because these code paths used the standard image crate which doesn't support JPEG 2000. A shared load_image_for_ocr() helper now detects JP2/J2K/JBIG2 formats by magic bytes and uses hayro-jpeg2000/hayro-jbig2 decoders across all OCR backends. The ocr-wasm feature now includes these decoders (pure Rust, WASM-compatible).
  • WASM PDF empty content: initWasm() fired off PDFium initialization asynchronously without awaiting it, causing a race condition where PDF extraction could start before PDFium was ready, returning empty content. PDFium initialization is now properly awaited during initWasm().

Added

  • OMML-to-LaTeX math conversion for DOCX: Mathematical equations in DOCX files (Office Math Markup Language) are now converted to LaTeX notation instead of being rendered as concatenated Unicode text. Supports superscripts, subscripts, fractions (\frac), radicals (\sqrt), n-ary operators (\sum, \int), delimiters, function names, accents, equation arrays, limits, bars, border boxes, matrices, and pre-sub-superscripts. Display math uses $$...$$ and inline math uses $...$ in markdown output. Plain text output includes raw LaTeX without delimiters.

  • Plain text output paths for all extractors: When OutputFormat::Plain or OutputFormat::Structured is requested, DOCX, PPTX, ODT, FB2, DocBook, RTF, and Jupyter extractors now produce clean plain text without markdown syntax (#, **, |, ![](https://github.com/kreuzberg-dev/kreuzberg/blob/main/image), - , etc.). Previously these extractors always emitted markdown regardless of the requested output format.

    • DOCX: Document::to_plain_text() skips heading prefixes, inline formatting markers, image placeholders, and renders footnotes/endnotes as id: text instead of [^id]: text.
    • PPTX: ContentBuilder respects plain mode — skips # title prefix, image markers, list markers, and uses Notes: instead of ### Notes:.
    • ODT: Heading prefixes (# ), list markers (- ), and pipe-delimited tables conditionally omitted for plain text.
    • FB2/FictionBook: Inline markers (*, **, `, ~~), heading prefixes, and cite prefixes skipped for plain text.
    • DocBook: Section title prefixes, code fences, list markers, blockquote prefixes, bold figure captions, and pipe tables all conditionally omitted.
    • RTF: Table output in result string uses tab separation instead of pipe-delimited markdown. Image ![image](https://github.com/kreuzberg-dev/kreuzberg/blob/main/...) markers omitted for plain text.
    • Jupyter: Skips text/markdown and text/html output types in plain mode, preferring text/plain.
  • cells_to_text() shared utility: Tab-separated plain text table formatter alongside existing cells_to_markdown(). Used by DOCX, PPTX, ODT, RTF, and DocBook extractors for plain text table rendering.

Changed

  • CLI includes all features: kreuzberg-cli now depends on kreuzberg with the full feature set instead of a separate cli subset. The cli feature group has been removed from kreuzberg. This ensures the CLI supports all formats including archives (7z, tar, gz, zip).

Fixed

  • Alpine/musl CLI Docker image: Fixed "Dynamic loading not supported" error when running kreuzberg-cli in Alpine containers. The CLI binary is now dynamically linked against musl libc, enabling runtime library loading for PDF processing.
  • R package Windows installation: Improved Python detection in configure script for Windows environments (added py launcher and RETICULATE_PYTHON support). Symlink extraction errors during source package installation are now handled gracefully.
  • PHP 8.5 precompiled extension binaries: Added PHP 8.5 support alongside existing PHP 8.4 in CI and release workflows.
  • OCR DPI normalization: The normalize_image_dpi() preprocessing logic is now integrated into the OCR pipeline. Images are normalized to the configured target DPI before being passed to Tesseract, and the calculated DPI is set via set_source_resolution(). This eliminates the "Estimating resolution as ..." warning and improves OCR accuracy for images with non-standard DPI.
  • HTML metadata extraction with Plain output: Fixed HTML metadata (headers, links, images, structured data) not being collected when using OutputFormat::Plain (the default). The underlying library's plain text fast path skips metadata extraction; kreuzberg now uses Markdown format internally for metadata collection and converts to plain text separately.
  • PPTX text run spacing: Adjacent text runs within paragraphs are now joined with smart spacing instead of being concatenated directly ("HelloWorld" → "Hello World").
  • CSV Shift-JIS/cp932 encoding detection: encoding_rs is now a non-optional dependency. CSV files with Shift-JIS encoding are correctly decoded instead of producing mojibake. Fallback encoding detection tries common encodings (Shift-JIS, cp932, windows-1252, iso-8859-1, gb18030, big5).
  • EML multipart body extraction: All text/html body parts are now extracted by iterating over all indices instead of only index 0. Nested message/rfc822 parts in multipart/digest are recursively extracted.
  • EPUB media tag leakage: <video>, <audio>, <source>, <track>, <object>, <embed>, <iframe> tags no longer leak into extracted text. Added <br> → newline and <hr> → newline handling.
  • FB2 poem extraction: Added support for <poem>, <stanza>, and <v> (verse) elements. Previously poetry content was silently dropped.
  • FB2 Unicode sub/superscript: Characters inside <sup> and <sub> are converted to Unicode equivalents. Added strikethrough support, horizontal rules for <empty-line>, and footnote extraction from notes body.
  • ODT StarMath-to-Unicode conversion: Mathematical formulas in ODT files are now converted to Unicode equivalents (Greek letters, operators, super/subscripts) instead of raw StarMath syntax.
  • BibTeX output format: Output now uses @type{key, field = {value}} format matching standard BibTeX conventions.
  • LaTeX display math: \[...\] display math environments are converted to $...$ format.
  • RST directive preservation: Field lists, directive markers, and .. code-block:: directives are preserved in extracted text.
  • RTF table cell separators: Plain mode now uses pipe delimiters for table cells instead of tabs.
  • Typst extraction improvements: Layout directives stripped, headings output as plain text, tables extracted with column-aware layout, links output as display text only.
  • DOCX field codes refined: Field instructions (between begin and separate) are now skipped while field results (between separate and end) are preserved. Previously all content between field begin/end was dropped, losing visible text like "Figure 1:" and page numbers.

... (truncated)

Commits
  • 828d42e fix: restore referenced test documents, cargo fmt, and uv.lock
  • f5e7c04 Merge pull request #433 from kreuzberg-dev/docs/rust-python-api-docs
  • ac2f769 feat: add extracted_keywords, quality_score, processing_warnings, and `...
  • 58db599 feat: Document new ExtractionResult fields, DocumentMetadata fields, `Ext...
  • 0c356b7 chore: clean up unused test documents
  • 34a7bf7 chore: updated test documents
  • d90493b release: v4.4.2
  • 01abbc7 fix: Ruby PDF annotation extraction and E2E element type assertions
  • 13871e0 fix: JPEG 2000 OCR decode, WASM PDFium init race, C# mj2 MIME
  • bb4d348 fix: WASM OCR blocking event loop + benchmark harness fixes
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) from 4.2.9 to 4.4.2.
- [Release notes](https://github.com/kreuzberg-dev/kreuzberg/releases)
- [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
- [Commits](kreuzberg-dev/kreuzberg@v4.2.9...v4.4.2)

---
updated-dependencies:
- dependency-name: kreuzberg
  dependency-version: 4.4.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants