[FEAT] integrate OCRmyPDF on document upload in Redis Queue jobs#115
Merged
Conversation
…ipping in PDF settings
…and update Tesseract timeout description
5786a0a to
713c107
Compare
…le fast web view optimization, large image skipping, and set number of `jobs` to 0
- Added `lazy-loader` package to manage heavy dependencies like `cv2`, `pdf2image`, and `ocrmypdf` for improved performance. - Updated `pdm.lock` to reflect the new package addition and modified existing dependencies. - Cleaned up import statements in `analysis.py` and `preprocessing.py` to utilize lazy loading, ensuring these libraries are only loaded when needed. - Removed unnecessary try-except blocks for dependency availability checks, as all required packages are now included in the application.
- Updated `.env.dev` to enable PDF analysis and set quiet mode to false. - Introduced `PDFMetadata` and `PDFProcessingResponse` models for structured metadata handling in `analysis.py`. - Refactored `PDFPreprocessor` to utilize the new models and improve error handling during preprocessing. - Adjusted type hints in `analysis.py` and `preprocessing.py` for better clarity and consistency. - Updated job processing to reflect changes in the preprocessing method.
- Updated image handling in `_media.py`, `_hub.py`, and `_datasets.py` to utilize lazy loading for `PIL` and `datasets` modules. - Adjusted type hints and checks to ensure compatibility with lazy-loaded imports. - Enhanced error handling and type checking for image processing functions. - Modified `DataframeData` schema to use an alias for schema definition.
- Renamed `upload_reference_documents_job` to `upload_and_preprocess_documents_job` for clarity in functionality. - Updated references in `imports.py`, `analysis.py`, and test files to reflect the new job name. - Improved logging consistency by standardizing logger usage across the `PDFAnalyzer` class.
- Introduced `PDFTextLayerDetector` class to analyze PDF files for existing text layers. - Added methods for detecting text layers, checking OCR requirements, and retrieving pages needing OCR. - Refactored existing code to improve clarity and functionality, including the use of dataclasses for structured results. - Enhanced error handling for encrypted and invalid PDF files. - Updated module documentation to reflect new functionality.
- Updated `PDFTextAnalysisResult` to use `field(default_factory=list)` for better default list handling. - Enhanced OpenCV loading in `margin.py` to set CPU-only mode and added error handling for loading failures. - Adjusted imports in `preprocessing.py` to correctly reference `PDFAnalyzer` from the margin module.
- Replaced the `PDFTextLayerDetector` class with `PDFOCRLayerDetector` to streamline OCR text layer detection using `pdfminer`. - Introduced methods for checking font resources and analyzing character quality in PDFs. - Removed unused `figures.py` and `tables.py` files to clean up the codebase. - Enhanced error handling and logging for better debugging and user feedback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request brings in OCRmyPDF support on file upload and enhances PDF preprocessing with configurable settings:
Changelog
Please review the new options and let me know if further adjustments are needed.