Skip to content

[FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs#115

Merged
JonnyTran merged 26 commits into
developfrom
feat/ocrmypdf-on-upload
Aug 12, 2025
Merged

[FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs#115
JonnyTran merged 26 commits into
developfrom
feat/ocrmypdf-on-upload

Conversation

@JonnyTran
Copy link
Copy Markdown
Member

This pull request brings in OCRmyPDF support on file upload and enhances PDF preprocessing with configurable settings:

  1. added ocrmypdf (0258641)
  2. refactor: enhance PDF preprocessing with configurable settings and integrate OCRmyPDF (b4e15e6)
  3. feat: add margin analysis to PDF preprocessing with opencv-python (b7b8a1b)
  4. feat: enable PDF preprocessing analysis with new configuration options (25a9098)
  5. feat: update PDF preprocessing settings and add new document analysis schemas (ab7f39b)
  6. feat: introduce rotate pages threshold in PDF preprocessing settings and update Tesseract timeout description (3f84066)

Changelog

  • Configurable Tesseract OCR timeout and text skipping.
  • Image margin detection via OpenCV.
  • New rotate page threshold for better orientation correction.
  • Schema updates for analysis settings.

Please review the new options and let me know if further adjustments are needed.

@jonnywireless jonnywireless marked this pull request as draft August 5, 2025 18:04
@jonnywireless jonnywireless changed the title feat: integrate OCRmyPDF on upload [FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs Aug 5, 2025
@Extralit Extralit deleted a comment from codecov Bot Aug 6, 2025
@JonnyTran JonnyTran force-pushed the feat/ocrmypdf-on-upload branch from 5786a0a to 713c107 Compare August 8, 2025 18:51
…le fast web view optimization, large image skipping, and set number of `jobs` to 0
- Added `lazy-loader` package to manage heavy dependencies like `cv2`, `pdf2image`, and `ocrmypdf` for improved performance.
- Updated `pdm.lock` to reflect the new package addition and modified existing dependencies.
- Cleaned up import statements in `analysis.py` and `preprocessing.py` to utilize lazy loading, ensuring these libraries are only loaded when needed.
- Removed unnecessary try-except blocks for dependency availability checks, as all required packages are now included in the application.
- Updated `.env.dev` to enable PDF analysis and set quiet mode to false.
- Introduced `PDFMetadata` and `PDFProcessingResponse` models for structured metadata handling in `analysis.py`.
- Refactored `PDFPreprocessor` to utilize the new models and improve error handling during preprocessing.
- Adjusted type hints in `analysis.py` and `preprocessing.py` for better clarity and consistency.
- Updated job processing to reflect changes in the preprocessing method.
- Updated image handling in `_media.py`, `_hub.py`, and `_datasets.py` to utilize lazy loading for `PIL` and `datasets` modules.
- Adjusted type hints and checks to ensure compatibility with lazy-loaded imports.
- Enhanced error handling and type checking for image processing functions.
- Modified `DataframeData` schema to use an alias for schema definition.
@JonnyTran JonnyTran self-assigned this Aug 8, 2025
@JonnyTran JonnyTran marked this pull request as ready for review August 8, 2025 23:36
@JonnyTran JonnyTran requested review from a team as code owners August 8, 2025 23:36
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 29.78261% with 323 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...er/src/extralit_server/contexts/document/margin.py 14.21% 169 Missing ⚠️
.../src/extralit_server/contexts/document/analysis.py 0.00% 83 Missing ⚠️
...extralit_server/contexts/document/preprocessing.py 49.01% 52 Missing ⚠️
...t-server/src/extralit_server/jobs/document_jobs.py 28.57% 5 Missing ⚠️
extralit/src/extralit/datasets/_io/_hub.py 42.85% 4 Missing ⚠️
extralit/src/extralit/records/_io/_datasets.py 73.33% 4 Missing ⚠️
...xtralit-server/src/extralit_server/contexts/hub.py 81.81% 2 Missing ⚠️
extralit/src/extralit/_helpers/_media.py 81.81% 2 Missing ⚠️
...xtralit-server/src/extralit_server/cli/__main__.py 0.00% 1 Missing ⚠️
extralit/src/extralit/records/_dataset_records.py 66.66% 1 Missing ⚠️
Flag Coverage Δ
extralit 64.93% <71.05%> (-0.01%) ⬇️
extralit-server 81.13% <26.06%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...it_server/api/schemas/v1/document/preprocessing.py 100.00% <100.00%> (ø)
...rver/src/extralit_server/api/schemas/v1/imports.py 100.00% <100.00%> (ø)
...xtralit-server/src/extralit_server/cli/__init__.py 100.00% <100.00%> (ø)
...lit-server/src/extralit_server/contexts/imports.py 60.52% <100.00%> (ø)
extralit/src/extralit/cli/documents/import_bib.py 7.08% <100.00%> (+0.25%) ⬆️
extralit/src/extralit/records/_io/__init__.py 100.00% <ø> (ø)
...xtralit-server/src/extralit_server/cli/__main__.py 0.00% <0.00%> (ø)
extralit/src/extralit/records/_dataset_records.py 91.39% <66.66%> (-0.50%) ⬇️
...xtralit-server/src/extralit_server/contexts/hub.py 60.07% <81.81%> (ø)
extralit/src/extralit/_helpers/_media.py 62.12% <81.81%> (+0.21%) ⬆️
... and 6 more

... and 177 files with indirect coverage changes

Components Coverage Δ
extralit 64.93% <71.05%> (-0.01%) ⬇️
extralit-server 81.13% <26.06%> (∅)
extralit-frontend 10.71% <ø> (+0.12%) ⬆️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@JonnyTran JonnyTran marked this pull request as draft August 8, 2025 23:56
- Renamed `upload_reference_documents_job` to `upload_and_preprocess_documents_job` for clarity in functionality.
- Updated references in `imports.py`, `analysis.py`, and test files to reflect the new job name.
- Improved logging consistency by standardizing logger usage across the `PDFAnalyzer` class.
- Introduced `PDFTextLayerDetector` class to analyze PDF files for existing text layers.
- Added methods for detecting text layers, checking OCR requirements, and retrieving pages needing OCR.
- Refactored existing code to improve clarity and functionality, including the use of dataclasses for structured results.
- Enhanced error handling for encrypted and invalid PDF files.
- Updated module documentation to reflect new functionality.
@JonnyTran JonnyTran marked this pull request as ready for review August 9, 2025 19:33
- Updated `PDFTextAnalysisResult` to use `field(default_factory=list)` for better default list handling.
- Enhanced OpenCV loading in `margin.py` to set CPU-only mode and added error handling for loading failures.
- Adjusted imports in `preprocessing.py` to correctly reference `PDFAnalyzer` from the margin module.
- Replaced the `PDFTextLayerDetector` class with `PDFOCRLayerDetector` to streamline OCR text layer detection using `pdfminer`.
- Introduced methods for checking font resources and analyzing character quality in PDFs.
- Removed unused `figures.py` and `tables.py` files to clean up the codebase.
- Enhanced error handling and logging for better debugging and user feedback.
@JonnyTran JonnyTran merged commit 80cbaa3 into develop Aug 12, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants