Skip to content

feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa…#46

Merged
0xbbjoker merged 1 commit into1.xfrom
feat/migrate-to-unpdf-parser
Nov 10, 2025
Merged

feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa…#46
0xbbjoker merged 1 commit into1.xfrom
feat/migrate-to-unpdf-parser

Conversation

@0xbbjoker
Copy link
Copy Markdown

@0xbbjoker 0xbbjoker commented Nov 7, 2025

Replace pdfjs-dist with unpdf to enable PDF text extraction in browser,
serverless, and Node.js environments without worker configuration.

Changes:

  • Replaced pdfjs-dist (700KB+, complex setup) with unpdf (~50KB, zero config)
  • Added proper Buffer to Uint8Array conversion for unpdf compatibility
  • Updated package.json to use unpdf v1.4.0
  • Simplified PDF text extraction logic with cleaner error handling

Benefits:

  • ✅ Browser compatible (no worker files needed)
  • ✅ Serverless compatible (AWS Lambda, Cloudflare Workers, etc.)
  • ✅ Smaller bundle size (~90% reduction)
  • ✅ Simpler API and configuration
  • ✅ Works across Node.js, Bun, Deno, and Edge runtimes

Breaking changes: None

  • PDF documents still stored as base64 in database
  • Text fragments still extracted for search
  • All existing functionality preserved

Note

Replaces pdfjs-dist with unpdf for PDF text extraction and updates dependencies/version.

  • PDF extraction:
    • Replace pdfjs-dist with unpdf and refactor convertPdfToTextFromBuffer to use extractText, proper Uint8Array conversion, whitespace cleanup, and improved logging/error handling; remove pdfjs-specific imports/helpers.
  • Dependencies:
    • Add unpdf, remove pdfjs-dist, bump @elizaos/core to ^1.6.4, and bump package version to 1.5.14.

Written by Cursor Bugbot for commit 1d45e95. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • Chores
    • Version bumped to 1.5.14
    • Updated core dependency to 1.6.4
    • Replaced PDF parsing library with improved alternative for better text extraction reliability and performance

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 7, 2025

Walkthrough

Package version incremented to 1.5.14. Dependency @elizaos/core updated to ^1.6.4. PDF parsing library switched from pdfjs-dist to unpdf. PDF text extraction logic refactored to use the new library's API with simplified text merging and whitespace normalization.

Changes

Cohort / File(s) Summary
Dependency Updates
package.json
Version bumped to 1.5.14. @elizaos/core updated from ^1.5.10 to ^1.6.4. pdfjs-dist (^5.2.133) removed. unpdf (^1.4.0) added.
PDF Parsing Refactor
src/utils.ts
Replaced pdfjs-dist imports with unpdf's extractText function. convertPdfToTextFromBuffer rewritten to use unpdf for text extraction with page merging. Removed legacy isTextItem helper and per-page processing logic. Added text deduplication, whitespace normalization, and enhanced logging with page count.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant convertPdfToTextFromBuffer
    participant unpdf as unpdf.extractText
    participant TextProcessor

    Caller->>convertPdfToTextFromBuffer: PDF buffer
    convertPdfToTextFromBuffer->>unpdf: Extract text from all pages
    unpdf-->>convertPdfToTextFromBuffer: Merged text
    alt Text extracted successfully
        convertPdfToTextFromBuffer->>TextProcessor: Deduplicate & normalize whitespace
        TextProcessor-->>convertPdfToTextFromBuffer: Cleaned text
        convertPdfToTextFromBuffer-->>Caller: Return processed text
    else Empty or no text
        convertPdfToTextFromBuffer-->>Caller: Warning logged, return empty string
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Areas for attention:
    • Verify unpdf API usage is correct (extractText function behavior, page merging)
    • Confirm whitespace normalization preserves intended paragraph breaks
    • Check that error handling for empty/missing text is sufficient
    • Validate @elizaos/core version bump compatibility with existing code

Poem

🐰 The PDFs once parsed line by line,
Now unpdf makes extraction so fine!
No more complexity, just text that's clean—
The smoothest refactor we've ever seen! 📖✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: migrating from pdfjs-dist to unpdf for PDF parsing, which is the primary focus of the changeset across package.json and src/utils.ts.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/migrate-to-unpdf-parser

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9b04019 and 1d45e95.

📒 Files selected for processing (2)
  • package.json (3 hunks)
  • src/utils.ts (3 hunks)
🔇 Additional comments (6)
src/utils.ts (4)

4-4: LGTM! Simplified import.

The switch to unpdf's extractText function is cleaner than the previous pdfjs-dist implementation, which required worker configuration.


114-116: LGTM! Documentation updated appropriately.

The JSDoc correctly reflects the migration to unpdf and highlights its universal compatibility across different JavaScript runtimes.


131-133: LGTM! Proper Buffer to Uint8Array conversion.

The conversion correctly creates a pure Uint8Array from the Buffer's underlying ArrayBuffer, accounting for byteOffset and byteLength. This is necessary because while Buffer extends Uint8Array, some libraries like unpdf may require a pure Uint8Array instance.


135-142: LGTM! Clean implementation of PDF extraction.

The unpdf integration is well-implemented:

  • Correct usage of extractText with mergePages: true
  • Appropriate handling of empty text extraction
  • Good logging for debugging (includes page count and text length)
  • Proper error handling with context preservation

Also applies to: 152-159

package.json (2)

46-46: Dependency version confirmed.

Version 1.4.0 is the latest stable version of unpdf, and it supports the extractText function with the mergePages option. The dependency specification is appropriate.


4-4: Version updates align with migration strategy.

The patch version bump (1.5.13 → 1.5.14) and @elizaos/core minor version update (^1.5.10 → ^1.6.4) are appropriate for the pdfjs-dist to unpdf migration. All 16+ imported symbols from @elizaos/core across the codebase—including UUID, logger, IAgentRuntime, Memory, State, and others—are part of the documented v1.x public API. The minor version bump follows semantic versioning conventions for backward-compatible changes. Run type checking and integration tests to confirm functionality post-merge.

Comment on lines +145 to +150
const cleanedText = result.text
.split('\n')
.map((line: string) => line.trim())
.filter((line: string) => line.length > 0)
.join('\n')
.replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix text cleaning logic - it removes all paragraph breaks.

The current implementation has a logic error that defeats its stated purpose. The .filter((line: string) => line.length > 0) step removes ALL empty lines, making the subsequent regex /\n{3,}/g ineffective (it can never match after filtering). This results in losing all paragraph structure in the extracted text.

Apply this diff to preserve paragraph structure while still limiting excessive blank lines:

     // Clean up excessive whitespace while preserving paragraph structure
     const cleanedText = result.text
       .split('\n')
       .map((line: string) => line.trim())
-      .filter((line: string) => line.length > 0)
       .join('\n')
       .replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines

This will:

  • Trim leading/trailing whitespace from each line
  • Preserve empty lines (paragraph breaks)
  • Limit to a maximum of 1 blank line between paragraphs
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const cleanedText = result.text
.split('\n')
.map((line: string) => line.trim())
.filter((line: string) => line.length > 0)
.join('\n')
.replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines
const cleanedText = result.text
.split('\n')
.map((line: string) => line.trim())
.join('\n')
.replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines
🤖 Prompt for AI Agents
In src/utils.ts around lines 145 to 150, the current cleaning trims each line
then filters out empty lines which removes all paragraph breaks; instead remove
the .filter(...) so empty lines are preserved as paragraph separators, trim each
line, then collapse excessive blank lines by replacing runs of 3+ consecutive
newlines with exactly 2 (so at most one empty line between paragraphs), and
finally trim leading/trailing whitespace of the whole string.

@0xbbjoker 0xbbjoker merged commit f8062fa into 1.x Nov 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant