feat: Implement RQ-based PyMuPDF integration for async PDF processing#135
Merged
JonnyTran merged 14 commits intoAug 20, 2025
Merged
Conversation
JonnyTran
reviewed
Aug 17, 2025
Comment on lines
+38
to
+43
| def get_redis_connection(): | ||
| """Get or create Redis connection.""" | ||
| global _redis | ||
| if _redis is None: | ||
| _redis = redis.from_url(REDIS_URL) | ||
| return _redis |
Member
There was a problem hiding this comment.
You can import from extralit_server.jobs.queues import REDIS_CONNECTION
from extralit_server.settings import settings
if settings.redis_use_cluster:
REDIS_CONNECTION = RedisCluster.from_url(settings.redis_url)
else:
REDIS_CONNECTION = redis.from_url(settings.redis_url)
JonnyTran
reviewed
Aug 17, 2025
JonnyTran
reviewed
Aug 17, 2025
- Fix trailing whitespace and formatting issues - Apply ruff formatting to RQ client and text modules - Ensure code follows project style guidelines
- Fix trailing whitespace and code formatting issues - Apply ruff formatting to pdf_extraction_jobs.py and document_jobs.py - Ensure code follows project style guidelines
c4bc4db to
144c517
Compare
JonnyTran
reviewed
Aug 20, 2025
JonnyTran
reviewed
Aug 20, 2025
…DF workflow - Updated `pyproject.toml` to use `rq` version 2.4.1. - Removed the `rq_client.py` file and its associated functions to streamline job management. - Adjusted job handling in `jobs.py` and `pdf.py` to reflect the removal of the RQ client, ensuring proper job enqueueing and dependency management. - Cleaned up unused imports and improved type handling in `text.py`.
JonnyTran
reviewed
Aug 20, 2025
JonnyTran
reviewed
Aug 20, 2025
Member
There was a problem hiding this comment.
I don't think we need this file, since it's already being called in start_pdf_workflow
- Deleted the `pdf_extraction_jobs.py` file to simplify job orchestration. - Removed `PDF_QUEUE` from the job queues, transitioning to `DEFAULT_QUEUE` for PDF extraction tasks. - Updated `text.py` to eliminate unused functions and improve code clarity. - Adjusted the PDF workflow to reflect changes in job handling and ensure proper integration with the new structure.
Member
|
This pull request introduces a new job queue specifically for PDF OCR processing and updates the PDF workflow to utilize this new queue for text extraction jobs. The main goal is to better organize and route jobs related to PDF OCR, separating them from the default queue, and to prepare for future workflow enhancements. Job queue improvements:
Workflow enhancements:
|
JonnyTran
added a commit
that referenced
this pull request
Aug 22, 2025
* design * Update rq dependency to version 2.4.1 and adjust pdm.lock accordingly * design * design v2 * design v3 * design v3 * design v4 * design v5 * design v5 * 1.1 Create combined PDF processing job function * Refactor PDF processing job function for improved efficiency * 1.5 Update process_bulk_upload function * refactor * task 2.4: Add workflow status monitoring * fix AsyncSessionLocal * Refactor database migration and PDF workflow to enhance structure and clarity * Refactor PDF workflow functions to use workspace name instead of ID and improve document handling for uploads without associated files * fix * Update metadata fields to be optional and improve error handling in PDF analysis * fix REDIS_CONNECTION arg * refactor * latest * Update metadata fields in Analysis, Preprocessing, Text Extraction, Table Extraction, and Embedding classes to be optional strings, enhancing flexibility in handling completion timestamps. * feat: Implement RQ-based PyMuPDF integration for async PDF processing (#135) * fix: apply code formatting and linting fixes - Fix trailing whitespace and formatting issues - Apply ruff formatting to RQ client and text modules - Ensure code follows project style guidelines * fix: apply formatting and linting to PDF extraction pipeline - Fix trailing whitespace and code formatting issues - Apply ruff formatting to pdf_extraction_jobs.py and document_jobs.py - Ensure code follows project style guidelines * fix: apply final ruff formatting fixes * fix: update default extraction queue name to 'pdf_queue' * refactor: remove unused PDF extraction logic from document upload job * addedpdf extraction * feat: add PDF_QUEUE for PyMuPDF extraction jobs * feat: integrate PyMuPDF extraction job into PDF workflow * fix: add missing Optional import in jobs.py and add queue test * test: organize all test scripts into tests/ folder with comprehensive runners * refactor: update RQ client integration and clean up job handling in PDF workflow - Updated `pyproject.toml` to use `rq` version 2.4.1. - Removed the `rq_client.py` file and its associated functions to streamline job management. - Adjusted job handling in `jobs.py` and `pdf.py` to reflect the removal of the RQ client, ensuring proper job enqueueing and dependency management. - Cleaned up unused imports and improved type handling in `text.py`. * refactor: remove PDF_QUEUE and streamline PDF extraction workflow - Deleted the `pdf_extraction_jobs.py` file to simplify job orchestration. - Removed `PDF_QUEUE` from the job queues, transitioning to `DEFAULT_QUEUE` for PDF extraction tasks. - Updated `text.py` to eliminate unused functions and improve code clarity. - Adjusted the PDF workflow to reflect changes in job handling and ensure proper integration with the new structure. * renamed PDF_QUEUE to PDF_OCR_QUEUE --------- Co-authored-by: JonnyTran <nhat.c.tran@gmail.com> * feat: Add OCR_QUEUE for improved job handling in PDF workflows - Introduced OCR_QUEUE to manage OCR-related jobs. - Updated worker options to include OCR_QUEUE for enhanced queue listening. - Refactored PDF workflow to utilize OCR_QUEUE for text extraction jobs, replacing the previous PDF_OCR_QUEUE reference. * refactor: Rename ImportHistory table to 'imports' for consistency - Updated the ImportHistory model and associated database migration to reflect the new table name 'imports'. - Adjusted references in the database model and migration scripts accordingly. * refactoring * refactor * renames * requirement changes * use group instead of job ids * updated DocumentWorkflow class * rq.Group updated tasks and design * Refactor workflows to use RQ Groups for job tracking - Replace job_ids with group_id and status in DocumentWorkflow - Implement workflow status and job queries using RQ Groups - Update process_bulk_upload and create_document_workflow for group-based orchestration - Add workflow status helpers and resumability methods to model - Update API schemas and Alembic migration for new fields * Add TextExtractionMetadata schema and update bulk upload processing - Introduced TextExtractionMetadata class for tracking text extraction results. - Updated process_bulk_upload to use type hints for better clarity. - Refactored document upload handling to streamline document creation and job enqueuing. - Changed job queue from DEFAULT_QUEUE to OCR_QUEUE for text extraction tasks. * Update job_ids type in DocumentsBulkResponse and refactor create_document_workflow return type - Changed job_ids in DocumentsBulkResponse from dict[str, Any] to dict[str, str] for better clarity. - Refactored create_document_workflow to return an RQ Group instead of a dictionary, simplifying the workflow tracking process. * 2.1 Implement RQ Groups-based job querying * Add CLI and API support for PDF workflow management - Implement FastAPI endpoints for workflow start, status, restart, and list - Add Pydantic schemas for workflow API requests and responses - Integrate workflow router into API routes - Add CLI commands for workflow start, status, restart, and list with Rich output - Extend workflow context for RQ Groups operations and error handling - Add unit tests for jobs API with RQ Groups integration * fixes * tests * fix tests --------- Co-authored-by: Priyankesh <priyankeshom@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: Implement RQ-based PyMuPDF integration for async PDF processing
🎯 Overview
This PR implements Phase 2 of the RQ-PyMuPDF integration plan, adding direct Redis Queue (RQ) communication between extralit-server and extralit-hf-space workers for asynchronous PDF processing with AGPL compliance.
🏗️ Architecture Changes
Before (HTTP-based)
After (RQ-based)
✨ Key Features
📁 Files Added
Core RQ Integration
src/extralit_server/contexts/ocr/rq_client.py- Direct RQ client forpdf_queuesrc/extralit_server/jobs/pdf_extraction_jobs.py- PDF extraction orchestration with RQ pollingEnhanced Pipeline
src/extralit_server/jobs/document_jobs.py- Integrated RQ extraction into document processingpyproject.toml- Added Redis dependency🔧 Configuration
Environment Variables
🧪 Testing
Quick Test
Integration Testing
The RQ client includes comprehensive testing utilities for:
🔄 Backward Compatibility
📋 Implementation Details
Direct RQ Communication
"src.jobs.extraction_jobs.extract_pdf_markdown_job"pdf_queueas extralit-hf-space workerSmart Extraction Policies
Error Handling
🔗 Related Changes
This PR works in conjunction with:
feat/rq-pymupdf-integrationbranch - Pure RQ worker implementationBranch:
feat/rq-pymupdf-integrationType: Feature
Breaking Changes: None
Dependencies: Redis server, extralit-hf-space RQ worker