Skip to content

[FEAT] Document workflows with rq jobs#136

Merged
JonnyTran merged 44 commits into
developfrom
feat/ocr-rq-jobs-workflow
Aug 22, 2025
Merged

[FEAT] Document workflows with rq jobs#136
JonnyTran merged 44 commits into
developfrom
feat/ocr-rq-jobs-workflow

Conversation

@JonnyTran
Copy link
Copy Markdown
Member

This pull request introduces a comprehensive specification and implementation plan for a new PDF Workflow Orchestrator using Redis Queue (RQ) native features, and begins the database migration work to support document workflow tracking. It also upgrades the RQ dependency to the latest major version and adds documentation for developers and AI code assistants. The most important changes are summarized below:


PDF Workflow Orchestrator Specification and Planning

  • Added a detailed requirements document outlining the use of RQ's native job chaining, metadata, job groups, and queue management to orchestrate multi-step PDF processing workflows, including analysis, preprocessing, OCR, extraction, and embedding. (.kiro/specs/pdf-workflow-orchestrator/requirements.md)
  • Provided a multi-phase implementation plan with specific tasks for refactoring jobs, API enhancements, workflow monitoring, CLI tools, and error handling, mapped to the requirements. (.kiro/specs/pdf-workflow-orchestrator/tasks.md)

Database and Model Changes

  • Introduced a new Alembic migration to create a workflows table for tracking document processing workflows, including job IDs, status, and links to documents. (extralit-server/src/extralit_server/alembic/versions/54d65879a68e_create_document_workflows_table.py)

Dependency Updates

  • Upgraded the rq Python package from version 1.16.2 to 2.4.1 to leverage new features and maintain compatibility with the orchestrator design. (extralit-server/pyproject.toml)

Developer and AI Assistant Documentation

  • Added CLAUDE.md files at both the repository root and within extralit-server/ to provide architecture overviews, development workflows, and guidance for contributors and AI code assistants. (CLAUDE.md, extralit-server/CLAUDE.md) [1] [2]

@JonnyTran JonnyTran requested review from a team as code owners August 18, 2025 07:31
@JonnyTran JonnyTran self-assigned this Aug 18, 2025
@JonnyTran JonnyTran marked this pull request as draft August 18, 2025 07:38
@JonnyTran JonnyTran force-pushed the feat/ocr-rq-jobs-workflow branch from 542af45 to 72c6dec Compare August 19, 2025 21:40
…able Extraction, and Embedding classes to be optional strings, enhancing flexibility in handling completion timestamps.
priyankeshh and others added 3 commits August 19, 2025 23:52
…#135)

* fix: apply code formatting and linting fixes

- Fix trailing whitespace and formatting issues
- Apply ruff formatting to RQ client and text modules
- Ensure code follows project style guidelines

* fix: apply formatting and linting to PDF extraction pipeline

- Fix trailing whitespace and code formatting issues
- Apply ruff formatting to pdf_extraction_jobs.py and document_jobs.py
- Ensure code follows project style guidelines

* fix: apply final ruff formatting fixes

* fix: update default extraction queue name to 'pdf_queue'

* refactor: remove unused PDF extraction logic from document upload job

* addedpdf extraction

* feat: add PDF_QUEUE for PyMuPDF extraction jobs

* feat: integrate PyMuPDF extraction job into PDF workflow

* fix: add missing Optional import in jobs.py and add queue test

* test: organize all test scripts into tests/ folder with comprehensive runners

* refactor: update RQ client integration and clean up job handling in PDF workflow

- Updated `pyproject.toml` to use `rq` version 2.4.1.
- Removed the `rq_client.py` file and its associated functions to streamline job management.
- Adjusted job handling in `jobs.py` and `pdf.py` to reflect the removal of the RQ client, ensuring proper job enqueueing and dependency management.
- Cleaned up unused imports and improved type handling in `text.py`.

* refactor: remove PDF_QUEUE and streamline PDF extraction workflow

- Deleted the `pdf_extraction_jobs.py` file to simplify job orchestration.
- Removed `PDF_QUEUE` from the job queues, transitioning to `DEFAULT_QUEUE` for PDF extraction tasks.
- Updated `text.py` to eliminate unused functions and improve code clarity.
- Adjusted the PDF workflow to reflect changes in job handling and ensure proper integration with the new structure.

* renamed PDF_QUEUE to PDF_OCR_QUEUE

---------

Co-authored-by: JonnyTran <nhat.c.tran@gmail.com>
- Introduced OCR_QUEUE to manage OCR-related jobs.
- Updated worker options to include OCR_QUEUE for enhanced queue listening.
- Refactored PDF workflow to utilize OCR_QUEUE for text extraction jobs, replacing the previous PDF_OCR_QUEUE reference.
@JonnyTran JonnyTran marked this pull request as ready for review August 21, 2025 07:36
- Updated the ImportHistory model and associated database migration to reflect the new table name 'imports'.
- Adjusted references in the database model and migration scripts accordingly.
JonnyTran and others added 14 commits August 21, 2025 10:10
- Replace job_ids with group_id and status in DocumentWorkflow
- Implement workflow status and job queries using RQ Groups
- Update process_bulk_upload and create_document_workflow for group-based orchestration
- Add workflow status helpers and resumability methods to model
- Update API schemas and Alembic migration for new fields
- Introduced TextExtractionMetadata class for tracking text extraction results.
- Updated process_bulk_upload to use type hints for better clarity.
- Refactored document upload handling to streamline document creation and job enqueuing.
- Changed job queue from DEFAULT_QUEUE to OCR_QUEUE for text extraction tasks.
…ment_workflow return type

- Changed job_ids in DocumentsBulkResponse from dict[str, Any] to dict[str, str] for better clarity.
- Refactored create_document_workflow to return an RQ Group instead of a dictionary, simplifying the workflow tracking process.
- Implement FastAPI endpoints for workflow start, status, restart, and list
- Add Pydantic schemas for workflow API requests and responses
- Integrate workflow router into API routes
- Add CLI commands for workflow start, status, restart, and list with Rich output
- Extend workflow context for RQ Groups operations and error handling
- Add unit tests for jobs API with RQ Groups integration
@JonnyTran JonnyTran marked this pull request as ready for review August 21, 2025 23:38
@JonnyTran JonnyTran requested a review from a team as a code owner August 21, 2025 23:38
@JonnyTran JonnyTran merged commit 0c14850 into develop Aug 22, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants