[FEAT] Document workflows with rq jobs#136
Merged
Merged
Conversation
…lit into feat/ocr-rq-jobs-workflow
…nd improve document handling for uploads without associated files
542af45 to
72c6dec
Compare
…able Extraction, and Embedding classes to be optional strings, enhancing flexibility in handling completion timestamps.
…#135) * fix: apply code formatting and linting fixes - Fix trailing whitespace and formatting issues - Apply ruff formatting to RQ client and text modules - Ensure code follows project style guidelines * fix: apply formatting and linting to PDF extraction pipeline - Fix trailing whitespace and code formatting issues - Apply ruff formatting to pdf_extraction_jobs.py and document_jobs.py - Ensure code follows project style guidelines * fix: apply final ruff formatting fixes * fix: update default extraction queue name to 'pdf_queue' * refactor: remove unused PDF extraction logic from document upload job * addedpdf extraction * feat: add PDF_QUEUE for PyMuPDF extraction jobs * feat: integrate PyMuPDF extraction job into PDF workflow * fix: add missing Optional import in jobs.py and add queue test * test: organize all test scripts into tests/ folder with comprehensive runners * refactor: update RQ client integration and clean up job handling in PDF workflow - Updated `pyproject.toml` to use `rq` version 2.4.1. - Removed the `rq_client.py` file and its associated functions to streamline job management. - Adjusted job handling in `jobs.py` and `pdf.py` to reflect the removal of the RQ client, ensuring proper job enqueueing and dependency management. - Cleaned up unused imports and improved type handling in `text.py`. * refactor: remove PDF_QUEUE and streamline PDF extraction workflow - Deleted the `pdf_extraction_jobs.py` file to simplify job orchestration. - Removed `PDF_QUEUE` from the job queues, transitioning to `DEFAULT_QUEUE` for PDF extraction tasks. - Updated `text.py` to eliminate unused functions and improve code clarity. - Adjusted the PDF workflow to reflect changes in job handling and ensure proper integration with the new structure. * renamed PDF_QUEUE to PDF_OCR_QUEUE --------- Co-authored-by: JonnyTran <nhat.c.tran@gmail.com>
- Introduced OCR_QUEUE to manage OCR-related jobs. - Updated worker options to include OCR_QUEUE for enhanced queue listening. - Refactored PDF workflow to utilize OCR_QUEUE for text extraction jobs, replacing the previous PDF_OCR_QUEUE reference.
- Updated the ImportHistory model and associated database migration to reflect the new table name 'imports'. - Adjusted references in the database model and migration scripts accordingly.
- Replace job_ids with group_id and status in DocumentWorkflow - Implement workflow status and job queries using RQ Groups - Update process_bulk_upload and create_document_workflow for group-based orchestration - Add workflow status helpers and resumability methods to model - Update API schemas and Alembic migration for new fields
- Introduced TextExtractionMetadata class for tracking text extraction results. - Updated process_bulk_upload to use type hints for better clarity. - Refactored document upload handling to streamline document creation and job enqueuing. - Changed job queue from DEFAULT_QUEUE to OCR_QUEUE for text extraction tasks.
…ment_workflow return type - Changed job_ids in DocumentsBulkResponse from dict[str, Any] to dict[str, str] for better clarity. - Refactored create_document_workflow to return an RQ Group instead of a dictionary, simplifying the workflow tracking process.
- Implement FastAPI endpoints for workflow start, status, restart, and list - Add Pydantic schemas for workflow API requests and responses - Integrate workflow router into API routes - Add CLI commands for workflow start, status, restart, and list with Rich output - Extend workflow context for RQ Groups operations and error handling - Add unit tests for jobs API with RQ Groups integration
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a comprehensive specification and implementation plan for a new PDF Workflow Orchestrator using Redis Queue (RQ) native features, and begins the database migration work to support document workflow tracking. It also upgrades the RQ dependency to the latest major version and adds documentation for developers and AI code assistants. The most important changes are summarized below:
PDF Workflow Orchestrator Specification and Planning
Database and Model Changes
workflowstable for tracking document processing workflows, including job IDs, status, and links to documents. (extralit-server/src/extralit_server/alembic/versions/54d65879a68e_create_document_workflows_table.py)Dependency Updates
rqPython package from version 1.16.2 to 2.4.1 to leverage new features and maintain compatibility with the orchestrator design. (extralit-server/pyproject.toml)Developer and AI Assistant Documentation
CLAUDE.mdfiles at both the repository root and withinextralit-server/to provide architecture overviews, development workflows, and guidance for contributors and AI code assistants. (CLAUDE.md, extralit-server/CLAUDE.md) [1] [2]