[FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs by JonnyTran · Pull Request #115 · Extralit/extralit

JonnyTran · 2025-08-05T18:04:06Z

This pull request brings in OCRmyPDF support on file upload and enhances PDF preprocessing with configurable settings:

added ocrmypdf (0258641)
refactor: enhance PDF preprocessing with configurable settings and integrate OCRmyPDF (b4e15e6)
feat: add margin analysis to PDF preprocessing with opencv-python (b7b8a1b)
feat: enable PDF preprocessing analysis with new configuration options (25a9098)
feat: update PDF preprocessing settings and add new document analysis schemas (ab7f39b)
feat: introduce rotate pages threshold in PDF preprocessing settings and update Tesseract timeout description (3f84066)

Changelog

Configurable Tesseract OCR timeout and text skipping.
Image margin detection via OpenCV.
New rotate page threshold for better orientation correction.
Schema updates for analysis settings.

Please review the new options and let me know if further adjustments are needed.

…tegrate OCRmyPDF

… schemas

…ipping in PDF settings

…and update Tesseract timeout description

…profiles

…le fast web view optimization, large image skipping, and set number of `jobs` to 0

- Added `lazy-loader` package to manage heavy dependencies like `cv2`, `pdf2image`, and `ocrmypdf` for improved performance. - Updated `pdm.lock` to reflect the new package addition and modified existing dependencies. - Cleaned up import statements in `analysis.py` and `preprocessing.py` to utilize lazy loading, ensuring these libraries are only loaded when needed. - Removed unnecessary try-except blocks for dependency availability checks, as all required packages are now included in the application.

- Updated `.env.dev` to enable PDF analysis and set quiet mode to false. - Introduced `PDFMetadata` and `PDFProcessingResponse` models for structured metadata handling in `analysis.py`. - Refactored `PDFPreprocessor` to utilize the new models and improve error handling during preprocessing. - Adjusted type hints in `analysis.py` and `preprocessing.py` for better clarity and consistency. - Updated job processing to reflect changes in the preprocessing method.

- Updated image handling in `_media.py`, `_hub.py`, and `_datasets.py` to utilize lazy loading for `PIL` and `datasets` modules. - Adjusted type hints and checks to ensure compatibility with lazy-loaded imports. - Enhanced error handling and type checking for image processing functions. - Modified `DataframeData` schema to use an alias for schema definition.

codecov · 2025-08-08T23:38:45Z

Codecov Report

❌ Patch coverage is 29.78261% with 323 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...er/src/extralit_server/contexts/document/margin.py	14.21%	169 Missing ⚠️
.../src/extralit_server/contexts/document/analysis.py	0.00%	83 Missing ⚠️
...extralit_server/contexts/document/preprocessing.py	49.01%	52 Missing ⚠️
...t-server/src/extralit_server/jobs/document_jobs.py	28.57%	5 Missing ⚠️
extralit/src/extralit/datasets/_io/_hub.py	42.85%	4 Missing ⚠️
extralit/src/extralit/records/_io/_datasets.py	73.33%	4 Missing ⚠️
...xtralit-server/src/extralit_server/contexts/hub.py	81.81%	2 Missing ⚠️
extralit/src/extralit/_helpers/_media.py	81.81%	2 Missing ⚠️
...xtralit-server/src/extralit_server/cli/__main__.py	0.00%	1 Missing ⚠️
extralit/src/extralit/records/_dataset_records.py	66.66%	1 Missing ⚠️

Flag	Coverage Δ
extralit	`64.93% <71.05%> (-0.01%)`	⬇️
extralit-server	`81.13% <26.06%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...it_server/api/schemas/v1/document/preprocessing.py	`100.00% <100.00%> (ø)`
...rver/src/extralit_server/api/schemas/v1/imports.py	`100.00% <100.00%> (ø)`
...xtralit-server/src/extralit_server/cli/__init__.py	`100.00% <100.00%> (ø)`
...lit-server/src/extralit_server/contexts/imports.py	`60.52% <100.00%> (ø)`
extralit/src/extralit/cli/documents/import_bib.py	`7.08% <100.00%> (+0.25%)`	⬆️
extralit/src/extralit/records/_io/__init__.py	`100.00% <ø> (ø)`
...xtralit-server/src/extralit_server/cli/__main__.py	`0.00% <0.00%> (ø)`
extralit/src/extralit/records/_dataset_records.py	`91.39% <66.66%> (-0.50%)`	⬇️
...xtralit-server/src/extralit_server/contexts/hub.py	`60.07% <81.81%> (ø)`
extralit/src/extralit/_helpers/_media.py	`62.12% <81.81%> (+0.21%)`	⬆️
... and 6 more

... and 177 files with indirect coverage changes

Components	Coverage Δ
extralit	`64.93% <71.05%> (-0.01%)`	⬇️
extralit-server	`81.13% <26.06%> (∅)`
extralit-frontend	`10.71% <ø> (+0.12%)`	⬆️

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Renamed `upload_reference_documents_job` to `upload_and_preprocess_documents_job` for clarity in functionality. - Updated references in `imports.py`, `analysis.py`, and test files to reflect the new job name. - Improved logging consistency by standardizing logger usage across the `PDFAnalyzer` class.

- Introduced `PDFTextLayerDetector` class to analyze PDF files for existing text layers. - Added methods for detecting text layers, checking OCR requirements, and retrieving pages needing OCR. - Refactored existing code to improve clarity and functionality, including the use of dataclasses for structured results. - Enhanced error handling for encrypted and invalid PDF files. - Updated module documentation to reflect new functionality.

- Updated `PDFTextAnalysisResult` to use `field(default_factory=list)` for better default list handling. - Enhanced OpenCV loading in `margin.py` to set CPU-only mode and added error handling for loading failures. - Adjusted imports in `preprocessing.py` to correctly reference `PDFAnalyzer` from the margin module.

- Replaced the `PDFTextLayerDetector` class with `PDFOCRLayerDetector` to streamline OCR text layer detection using `pdfminer`. - Introduced methods for checking font resources and analyzing character quality in PDFs. - Removed unused `figures.py` and `tables.py` files to clean up the codebase. - Enhanced error handling and logging for better debugging and user feedback.

JonnyTran added 7 commits August 4, 2025 23:17

added ocrmypdf

0258641

refactor: enhance PDF preprocessing with configurable settings and in…

b4e15e6

…tegrate OCRmyPDF

feat: add margin analysis to PDF preprocessing with opencv-python

b7b8a1b

feat: enable PDF preprocessing analysis with new configuration options

25a9098

feat: update PDF preprocessing settings and add new document analysis…

ab7f39b

… schemas

feat: add new preprocessing options for Tesseract timeout and text sk…

1a67fbc

…ipping in PDF settings

feat: introduce rotate pages threshold in PDF preprocessing settings …

3f84066

…and update Tesseract timeout description

jonnywireless marked this pull request as draft August 5, 2025 18:04

jonnywireless assigned priyankeshh Aug 5, 2025

jonnywireless modified the milestones: OCR-2: Document Processing Pipeline & Integration, OCR-4: Evaluation & OCR Service Deployment Aug 5, 2025

jonnywireless changed the title ~~feat: integrate OCRmyPDF on upload~~ [FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs Aug 5, 2025

Merge branch 'develop' into feat/ocrmypdf-on-upload

9889a44

Extralit deleted a comment from codecov Bot Aug 6, 2025

JonnyTran and others added 3 commits August 6, 2025 16:14

Merge branch 'develop' into feat/ocrmypdf-on-upload

42323ea

merge conflicts

b69e49e

initial local commit

713c107

JonnyTran force-pushed the feat/ocrmypdf-on-upload branch from 5786a0a to 713c107 Compare August 8, 2025 18:51

JonnyTran added 7 commits August 8, 2025 11:54

Modified database URL in .env.dev for better compatibility with user …

4940b2e

…profiles

optimize ocrmypdf params by updating optimization level to 0, disab…

0e59377

…le fast web view optimization, large image skipping, and set number of `jobs` to 0

Merge branch 'develop' into feat/ocrmypdf-on-upload

85322a8

fix typechecking

1eacf58

JonnyTran self-assigned this Aug 8, 2025

JonnyTran marked this pull request as ready for review August 8, 2025 23:36

JonnyTran requested review from a team as code owners August 8, 2025 23:36

add lazy-loader

484efd8

JonnyTran marked this pull request as draft August 8, 2025 23:56

JonnyTran added 2 commits August 8, 2025 18:01

JonnyTran marked this pull request as ready for review August 9, 2025 19:33

JonnyTran added 5 commits August 9, 2025 16:34

Merge branch 'develop' into feat/ocrmypdf-on-upload

eb952da

chore: EXTRALIT_DATABASE_URL to use a relative path in .env.dev

a8d9ef0

chore: lazy import bibtexparser

edb2b0c

JonnyTran merged commit 80cbaa3 into develop Aug 12, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs#115

[FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs#115
JonnyTran merged 26 commits into
developfrom
feat/ocrmypdf-on-upload

JonnyTran commented Aug 5, 2025

Uh oh!

codecov Bot commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JonnyTran commented Aug 5, 2025

Uh oh!

codecov Bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Aug 8, 2025 •

edited

Loading