feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa…#46
feat(plugin-knowledge): migrate from pdfjs-dist to unpdf for universa…#46
Conversation
WalkthroughPackage version incremented to 1.5.14. Dependency Changes
Sequence DiagramsequenceDiagram
participant Caller
participant convertPdfToTextFromBuffer
participant unpdf as unpdf.extractText
participant TextProcessor
Caller->>convertPdfToTextFromBuffer: PDF buffer
convertPdfToTextFromBuffer->>unpdf: Extract text from all pages
unpdf-->>convertPdfToTextFromBuffer: Merged text
alt Text extracted successfully
convertPdfToTextFromBuffer->>TextProcessor: Deduplicate & normalize whitespace
TextProcessor-->>convertPdfToTextFromBuffer: Cleaned text
convertPdfToTextFromBuffer-->>Caller: Return processed text
else Empty or no text
convertPdfToTextFromBuffer-->>Caller: Warning logged, return empty string
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
package.json(3 hunks)src/utils.ts(3 hunks)
🔇 Additional comments (6)
src/utils.ts (4)
4-4: LGTM! Simplified import.The switch to unpdf's
extractTextfunction is cleaner than the previous pdfjs-dist implementation, which required worker configuration.
114-116: LGTM! Documentation updated appropriately.The JSDoc correctly reflects the migration to unpdf and highlights its universal compatibility across different JavaScript runtimes.
131-133: LGTM! Proper Buffer to Uint8Array conversion.The conversion correctly creates a pure Uint8Array from the Buffer's underlying ArrayBuffer, accounting for
byteOffsetandbyteLength. This is necessary because while Buffer extends Uint8Array, some libraries like unpdf may require a pure Uint8Array instance.
135-142: LGTM! Clean implementation of PDF extraction.The unpdf integration is well-implemented:
- Correct usage of
extractTextwithmergePages: true- Appropriate handling of empty text extraction
- Good logging for debugging (includes page count and text length)
- Proper error handling with context preservation
Also applies to: 152-159
package.json (2)
46-46: Dependency version confirmed.Version 1.4.0 is the latest stable version of unpdf, and it supports the extractText function with the mergePages option. The dependency specification is appropriate.
4-4: Version updates align with migration strategy.The patch version bump (1.5.13 → 1.5.14) and
@elizaos/coreminor version update (^1.5.10 → ^1.6.4) are appropriate for the pdfjs-dist to unpdf migration. All 16+ imported symbols from@elizaos/coreacross the codebase—including UUID, logger, IAgentRuntime, Memory, State, and others—are part of the documented v1.x public API. The minor version bump follows semantic versioning conventions for backward-compatible changes. Run type checking and integration tests to confirm functionality post-merge.
| const cleanedText = result.text | ||
| .split('\n') | ||
| .map((line: string) => line.trim()) | ||
| .filter((line: string) => line.length > 0) | ||
| .join('\n') | ||
| .replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines |
There was a problem hiding this comment.
Fix text cleaning logic - it removes all paragraph breaks.
The current implementation has a logic error that defeats its stated purpose. The .filter((line: string) => line.length > 0) step removes ALL empty lines, making the subsequent regex /\n{3,}/g ineffective (it can never match after filtering). This results in losing all paragraph structure in the extracted text.
Apply this diff to preserve paragraph structure while still limiting excessive blank lines:
// Clean up excessive whitespace while preserving paragraph structure
const cleanedText = result.text
.split('\n')
.map((line: string) => line.trim())
- .filter((line: string) => line.length > 0)
.join('\n')
.replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlinesThis will:
- Trim leading/trailing whitespace from each line
- Preserve empty lines (paragraph breaks)
- Limit to a maximum of 1 blank line between paragraphs
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const cleanedText = result.text | |
| .split('\n') | |
| .map((line: string) => line.trim()) | |
| .filter((line: string) => line.length > 0) | |
| .join('\n') | |
| .replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines | |
| const cleanedText = result.text | |
| .split('\n') | |
| .map((line: string) => line.trim()) | |
| .join('\n') | |
| .replace(/\n{3,}/g, '\n\n'); // Max 2 consecutive newlines |
🤖 Prompt for AI Agents
In src/utils.ts around lines 145 to 150, the current cleaning trims each line
then filters out empty lines which removes all paragraph breaks; instead remove
the .filter(...) so empty lines are preserved as paragraph separators, trim each
line, then collapse excessive blank lines by replacing runs of 3+ consecutive
newlines with exactly 2 (so at most one empty line between paragraphs), and
finally trim leading/trailing whitespace of the whole string.
Replace pdfjs-dist with unpdf to enable PDF text extraction in browser,
serverless, and Node.js environments without worker configuration.
Changes:
Benefits:
Breaking changes: None
Note
Replaces pdfjs-dist with unpdf for PDF text extraction and updates dependencies/version.
pdfjs-distwithunpdfand refactorconvertPdfToTextFromBufferto useextractText, properUint8Arrayconversion, whitespace cleanup, and improved logging/error handling; removepdfjs-specific imports/helpers.unpdf, removepdfjs-dist, bump@elizaos/coreto^1.6.4, and bump package version to1.5.14.Written by Cursor Bugbot for commit 1d45e95. This will update automatically on new commits. Configure here.
Summary by CodeRabbit