feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa… by 0xbbjoker · Pull Request #46 · elizaos-plugins/plugin-knowledge

0xbbjoker · 2025-11-07T15:54:23Z

Replace pdfjs-dist with unpdf to enable PDF text extraction in browser,
serverless, and Node.js environments without worker configuration.

Changes:

Replaced pdfjs-dist (700KB+, complex setup) with unpdf (~50KB, zero config)
Added proper Buffer to Uint8Array conversion for unpdf compatibility
Updated package.json to use unpdf v1.4.0
Simplified PDF text extraction logic with cleaner error handling

Benefits:

✅ Browser compatible (no worker files needed)
✅ Serverless compatible (AWS Lambda, Cloudflare Workers, etc.)
✅ Smaller bundle size (~90% reduction)
✅ Simpler API and configuration
✅ Works across Node.js, Bun, Deno, and Edge runtimes

Breaking changes: None

PDF documents still stored as base64 in database
Text fragments still extracted for search
All existing functionality preserved

Note

Replaces pdfjs-dist with unpdf for PDF text extraction and updates dependencies/version.

PDF extraction:
- Replace pdfjs-dist with unpdf and refactor convertPdfToTextFromBuffer to use extractText, proper Uint8Array conversion, whitespace cleanup, and improved logging/error handling; remove pdfjs-specific imports/helpers.
Dependencies:
- Add unpdf, remove pdfjs-dist, bump @elizaos/core to ^1.6.4, and bump package version to 1.5.14.

^{Written by Cursor Bugbot for commit 1d45e95. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

Chores
- Version bumped to 1.5.14
- Updated core dependency to 1.6.4
- Replaced PDF parsing library with improved alternative for better text extraction reliability and performance

…l PDF parsing

coderabbitai · 2025-11-07T15:54:39Z

Walkthrough

Package version incremented to 1.5.14. Dependency @elizaos/core updated to ^1.6.4. PDF parsing library switched from pdfjs-dist to unpdf. PDF text extraction logic refactored to use the new library's API with simplified text merging and whitespace normalization.

Changes

Cohort / File(s)	Summary
Dependency Updates `package.json`	Version bumped to 1.5.14. `@elizaos/core` updated from ^1.5.10 to ^1.6.4. pdfjs-dist (^5.2.133) removed. unpdf (^1.4.0) added.
PDF Parsing Refactor `src/utils.ts`	Replaced pdfjs-dist imports with unpdf's extractText function. convertPdfToTextFromBuffer rewritten to use unpdf for text extraction with page merging. Removed legacy isTextItem helper and per-page processing logic. Added text deduplication, whitespace normalization, and enhanced logging with page count.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant convertPdfToTextFromBuffer
    participant unpdf as unpdf.extractText
    participant TextProcessor

    Caller->>convertPdfToTextFromBuffer: PDF buffer
    convertPdfToTextFromBuffer->>unpdf: Extract text from all pages
    unpdf-->>convertPdfToTextFromBuffer: Merged text
    alt Text extracted successfully
        convertPdfToTextFromBuffer->>TextProcessor: Deduplicate & normalize whitespace
        TextProcessor-->>convertPdfToTextFromBuffer: Cleaned text
        convertPdfToTextFromBuffer-->>Caller: Return processed text
    else Empty or no text
        convertPdfToTextFromBuffer-->>Caller: Warning logged, return empty string
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Areas for attention:
- Verify unpdf API usage is correct (extractText function behavior, page merging)
- Confirm whitespace normalization preserves intended paragraph breaks
- Check that error handling for empty/missing text is sufficient
- Validate @elizaos/core version bump compatibility with existing code

Poem

🐰 The PDFs once parsed line by line,
Now unpdf makes extraction so fine!
No more complexity, just text that's clean—
The smoothest refactor we've ever seen! 📖✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: migrating from pdfjs-dist to unpdf for PDF parsing, which is the primary focus of the changeset across package.json and src/utils.ts.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/migrate-to-unpdf-parser

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9b04019 and 1d45e95.

📒 Files selected for processing (2)

package.json (3 hunks)
src/utils.ts (3 hunks)

🔇 Additional comments (6)

src/utils.ts (4)

4-4: LGTM! Simplified import.

The switch to unpdf's extractText function is cleaner than the previous pdfjs-dist implementation, which required worker configuration.

114-116: LGTM! Documentation updated appropriately.

The JSDoc correctly reflects the migration to unpdf and highlights its universal compatibility across different JavaScript runtimes.

131-133: LGTM! Proper Buffer to Uint8Array conversion.

The conversion correctly creates a pure Uint8Array from the Buffer's underlying ArrayBuffer, accounting for byteOffset and byteLength. This is necessary because while Buffer extends Uint8Array, some libraries like unpdf may require a pure Uint8Array instance.

135-142: LGTM! Clean implementation of PDF extraction.

The unpdf integration is well-implemented:

Correct usage of extractText with mergePages: true

Appropriate handling of empty text extraction

Good logging for debugging (includes page count and text length)

Proper error handling with context preservation

Also applies to: 152-159

package.json (2)

46-46: Dependency version confirmed.

Version 1.4.0 is the latest stable version of unpdf, and it supports the extractText function with the mergePages option. The dependency specification is appropriate.

4-4: Version updates align with migration strategy.

The patch version bump (1.5.13 → 1.5.14) and @elizaos/core minor version update (^1.5.10 → ^1.6.4) are appropriate for the pdfjs-dist to unpdf migration. All 16+ imported symbols from @elizaos/core across the codebase—including UUID, logger, IAgentRuntime, Memory, State, and others—are part of the documented v1.x public API. The minor version bump follows semantic versioning conventions for backward-compatible changes. Run type checking and integration tests to confirm functionality post-merge.

coderabbitai · 2025-11-07T15:58:19Z

src/utils.ts

+    const cleanedText = result.text
+      .split('\n')
+      .map((line: string) => line.trim())
+      .filter((line: string) => line.length > 0)
+      .join('\n')
+      .replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines


⚠️ Potential issue | 🟠 Major

Fix text cleaning logic - it removes all paragraph breaks.

The current implementation has a logic error that defeats its stated purpose. The .filter((line: string) => line.length > 0) step removes ALL empty lines, making the subsequent regex /\n{3,}/g ineffective (it can never match after filtering). This results in losing all paragraph structure in the extracted text.

Apply this diff to preserve paragraph structure while still limiting excessive blank lines:

// Clean up excessive whitespace while preserving paragraph structure const cleanedText = result.text .split('\n') .map((line: string) => line.trim()) - .filter((line: string) => line.length > 0) .join('\n') .replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines

This will:

Trim leading/trailing whitespace from each line

Preserve empty lines (paragraph breaks)

Limit to a maximum of 1 blank line between paragraphs

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const cleanedText = result.text

.split('\n')

.map((line: string) => line.trim())

.filter((line: string) => line.length > 0)

.join('\n')

.replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines

const cleanedText = result.text

.split('\n')

.map((line: string) => line.trim())

.join('\n')

.replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines

🤖 Prompt for AI Agents

In src/utils.ts around lines 145 to 150, the current cleaning trims each line then filters out empty lines which removes all paragraph breaks; instead remove the .filter(...) so empty lines are preserved as paragraph separators, trim each line, then collapse excessive blank lines by replacing runs of 3+ consecutive newlines with exactly 2 (so at most one empty line between paragraphs), and finally trim leading/trailing whitespace of the whole string.

feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa…

1d45e95

…l PDF parsing

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

0xbbjoker merged commit f8062fa into 1.x Nov 10, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa…#46

feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa…#46
0xbbjoker merged 1 commit into1.xfrom
feat/migrate-to-unpdf-parser

0xbbjoker commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xbbjoker commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0xbbjoker commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 7, 2025 •

edited

Loading