feat: add simple ingestion workflow and document processing activities #36

amindadgar · 2025-05-07T08:45:34Z

Introduced IngestionWorkflow for orchestrating document ingestion requests.
Added process_document activity to handle document processing logic.
Created schema for IngestionRequest to define the structure of ingestion requests.
Updated registry and workflows to include new ingestion components.

Summary by CodeRabbit

New Features
- Introduced a workflow for document ingestion requests, enabling asynchronous document processing with retry and timeout controls.
- Added a new data model for ingestion requests, supporting flexible metadata handling and collection naming.
Chores
- Updated internal registry to include the new ingestion workflow and processing activity.
- Removed legacy test workflow and related activities.
- Minor documentation cleanup.

- Introduced IngestionWorkflow for orchestrating document ingestion requests. - Added process_document activity to handle document processing logic. - Created schema for IngestionRequest to define the structure of ingestion requests. - Updated registry and workflows to include new ingestion components.

coderabbitai · 2025-05-07T08:45:40Z

Walkthrough

A new Temporal-based ingestion workflow is introduced, including the IngestionWorkflow class and the process_document activity, to orchestrate document ingestion using a structured request model. Supporting changes add the IngestionRequest schema, update imports, and register the new workflow and activity in the system. Minor whitespace cleanup is also performed.

Changes

File(s)	Change Summary
hivemind_etl/simple_ingestion/pipeline.py	New file defining `IngestionWorkflow` (Temporal workflow) and `process_document` activity for document ingestion.
hivemind_etl/simple_ingestion/schema.py	Added new Pydantic model `IngestionRequest` to represent document ingestion requests with typed fields and defaults.
registry.py	Registered `process_document` in `ACTIVITIES` and `IngestionWorkflow` in `WORKFLOWS`; updated imports accordingly.
hivemind_etl/activities.py, workflows.py	Updated imports to include `process_document` and `IngestionWorkflow` respectively; removed old `SayHello` workflow.
hivemind_etl/website/website_etl.py	Removed a trailing whitespace line in the `WebsiteETL.__init__` docstring.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Temporal
    participant IngestionWorkflow
    participant process_document
    participant CustomIngestionPipeline

    Client->>Temporal: Start IngestionWorkflow.run(ingestion_request)
    Temporal->>IngestionWorkflow: Invoke run(ingestion_request)
    IngestionWorkflow->>process_document: Call with ingestion_request (with retry policy)
    process_document->>CustomIngestionPipeline: Initialize with communityId, collectionName
    process_document->>CustomIngestionPipeline: Ingest Document (docId, text, metadata, exclusions)
    CustomIngestionPipeline-->>process_document: Return
    process_document-->>IngestionWorkflow: Return
    IngestionWorkflow-->>Temporal: Workflow complete
    Temporal-->>Client: Completion notification

Poem

A workflow hops into the light,
To process docs both day and night.
With schemas new and pipelines bright,
Ingestion’s path is now just right.
Activities registered, imports in tow—
The ETL garden begins to grow!
🐇✨

Tip

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

hivemind_etl/simple_ingestion/schema.py (2)

27-29: Improve docstring clarity

The docstring for collectionName could be clearer by rephrasing the default value explanation.
-        Default is `None` means it would follow the default pattern of `[communityId]_[platformId]`
+        Default is `None`. When None, the collection name will follow the pattern `[communityId]_[platformId]`
36-36: Suggestion for better UUID generation

Consider using a more explicit approach for UUID generation that preserves the UUID type information until string conversion is necessary.
-    docId: str = str(uuid4())
+    docId: str = None
+
+    def __init__(self, **data):
+        super().__init__(**data)
+        if self.docId is None:
+            self.docId = str(uuid4())
This approach would allow passing an existing ID when needed while still generating a UUID when none is provided, and it preserves type information until the last moment.

hivemind_etl/simple_ingestion/pipeline.py (3)

37-48: Consider adding error handling and result feedback.

The workflow execution is well-configured with appropriate retry policies and timeouts. However, there's no error handling or status feedback mechanism to indicate success or failure.

Consider enhancing the workflow to return a status object:

@workflow.run
-async def run(self, ingestion_request: IngestionRequest) -> None:
+async def run(self, ingestion_request: IngestionRequest) -> dict:
    # ... existing code ...
    
    try:
        await execute_activity(
            process_document,
            ingestion_request,
            retry_policy=retry_policy,
            start_to_close_timeout=timedelta(minutes=5),
        )
+       return {"status": "success", "request_id": ingestion_request.docId}
+   except Exception as e:
+       workflow.logger.error(f"Ingestion failed: {str(e)}")
+       return {"status": "failed", "request_id": ingestion_request.docId, "error": str(e)}

51-67: Update activity docstring to reflect actual implementation.

The docstring mentions "This activity will be implemented by the user" but the implementation is already provided in this file.

@workflow.activity
async def process_document(
    ingestion_request: IngestionRequest,
) -> None:
    """Process the document according to the ingestion request specifications.

    Parameters
    ----------
    ingestion_request : IngestionRequest
        The request containing all necessary information for document processing,
        including community ID, platform ID, text content, and metadata.

    Notes
    -----
-   This activity will be implemented by the user to handle the actual document
-   processing logic, including any necessary embedding or LLM operations.
+   This activity handles document processing by initializing a CustomIngestionPipeline
+   with the appropriate collection name and running it on the document created from
+   the ingestion request data.
    """

1-88: Consider adding activity logging for observability.

The implementation would benefit from adding logging statements to track the progress and completion of activities for better observability in a production environment.

+import logging
+
from datetime import timedelta

from temporalio import workflow
from temporalio.common import RetryPolicy
from temporalio.workflow import execute_activity
from .schema import IngestionRequest
from tc_hivemind_backend.ingest_qdrant import CustomIngestionPipeline
from llama_index.core import Document

+logger = logging.getLogger(__name__)
+

@workflow.defn
class IngestionWorkflow:
    # ...existing code...

@workflow.activity
async def process_document(
    ingestion_request: IngestionRequest,
) -> None:
    # ...existing docstring...
+   logger.info(f"Starting document processing for doc_id: {ingestion_request.docId}")
    
    if ingestion_request.collectionName is None:
        collection_name = (
            f"{ingestion_request.communityId}_{ingestion_request.platformId}"
        )
+       logger.debug(f"Using default collection name: {collection_name}")
    else:
        collection_name = ingestion_request.collectionName
+       logger.debug(f"Using provided collection name: {collection_name}")

    # Initialize the ingestion pipeline
+   logger.debug(f"Initializing pipeline for community: {ingestion_request.communityId}")
    pipeline = CustomIngestionPipeline(
        community_id=ingestion_request.communityId,
        collectionName=collection_name,
    )

    document = Document(
        doc_id=ingestion_request.docId,
        text=ingestion_request.text,
        metadata=ingestion_request.metadata,
    )
+   logger.debug(f"Created document with ID: {document.doc_id}")

    try:
        pipeline.run_pipeline(docs=[document])
+       logger.info(f"Successfully processed document: {ingestion_request.docId}")
    except Exception as e:
+       logger.error(f"Failed to process document {ingestion_request.docId}: {str(e)}")
+       raise

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2b8526 and b791eab.

📒 Files selected for processing (6)

hivemind_etl/activities.py (1 hunks)
hivemind_etl/simple_ingestion/pipeline.py (1 hunks)
hivemind_etl/simple_ingestion/schema.py (1 hunks)
hivemind_etl/website/website_etl.py (1 hunks)
registry.py (4 hunks)
workflows.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (4)

hivemind_etl/activities.py (1)

hivemind_etl/simple_ingestion/pipeline.py (1)

process_document (52-87)

workflows.py (1)

hivemind_etl/simple_ingestion/pipeline.py (1)

IngestionWorkflow (12-48)

registry.py (1)

hivemind_etl/simple_ingestion/pipeline.py (2)

process_document (52-87)

IngestionWorkflow (12-48)

hivemind_etl/simple_ingestion/pipeline.py (2)

hivemind_etl/simple_ingestion/schema.py (1)

IngestionRequest (5-39)

workflows.py (1)

run (30-34)

🪛 Ruff (0.8.2)

hivemind_etl/activities.py

17-17: hivemind_etl.simple_ingestion.pipeline.process_document imported but unused

Remove unused import: hivemind_etl.simple_ingestion.pipeline.process_document

(F401)

workflows.py

14-14: hivemind_etl.simple_ingestion.pipeline.IngestionWorkflow imported but unused

Remove unused import: hivemind_etl.simple_ingestion.pipeline.IngestionWorkflow

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: ci / lint / Lint
GitHub Check: ci / test / Test

🔇 Additional comments (13)

hivemind_etl/website/website_etl.py (1)

22-22: Clean code improvement!

Removal of unnecessary trailing whitespace in the docstring improves code cleanliness.
hivemind_etl/activities.py (1)

16-18:
Details

✅ Verification successful

Import added for new ingestion capabilities

The import of process_document from the new simple_ingestion.pipeline module is necessary for the activity to be available for registration in the registry, even though it appears unused in this file directly.

🏁 Script executed:
#!/bin/bash
# Check where the imported process_document is used
rg -A 2 "process_document" --type py
Length of output: 739

🏁 Script executed:
#!/bin/bash
# Show import and registry list in registry.py to confirm usage of process_document
echo "== registry.py imports and registry definitions =="
sed -n '1,50p' registry.py
echo "..."
sed -n '50,100p' registry.py

# Show decorators or registration logic in pipeline.py around process_document
echo "== pipeline.py around process_document definition =="
sed -n '1,200p' hivemind_etl/simple_ingestion/pipeline.py | sed -n '1,100p'
Length of output: 4380
Import confirmed for activity registration
The process_document function is defined in hivemind_etl/simple_ingestion/pipeline.py and must be re-exported via hivemind_etl/activities.py so that registry.py can import and include it in its ACTIVITIES list. The current import is correct and no changes are needed.

🧰 Tools

🪛 Ruff (0.8.2)

17-17: hivemind_etl.simple_ingestion.pipeline.process_document imported but unused

Remove unused import: hivemind_etl.simple_ingestion.pipeline.process_document

(F401)
workflows.py (1)

13-15:
Details

✅ Verification successful

Import added for new workflow registration

The import of IngestionWorkflow is necessary for the workflow to be available for registration in the registry system, even though it appears unused in this file directly.

🏁 Script executed:
#!/bin/bash
# Check where IngestionWorkflow is used
rg -A 2 "IngestionWorkflow" --type py
Length of output: 525

🏁 Script executed:
#!/bin/bash
# Display the contents of workflows.py around the imports and workflow list
sed -n '1,200p' workflows.py
Length of output: 912

🏁 Script executed:
#!/bin/bash
# Display the first 200 lines of registry.py to inspect workflow registration
sed -n '1,200p' registry.py
Length of output: 1237
✅ Confirmed: IngestionWorkflow import is required for registry registration

The added import in workflows.py ensures that registry.py can successfully import and include IngestionWorkflow in the WORKFLOWS list. No further changes are needed.

• workflows.py: lines 13–15 – import of IngestionWorkflow
• registry.py: imports and lists IngestionWorkflow in WORKFLOWS

🧰 Tools

🪛 Ruff (0.8.2)

14-14: hivemind_etl.simple_ingestion.pipeline.IngestionWorkflow imported but unused

Remove unused import: hivemind_etl.simple_ingestion.pipeline.IngestionWorkflow

(F401)
registry.py (4)

11-11: New activity imported correctly

The process_document activity is properly imported from activities, allowing it to be registered in the system.

24-24: New workflow imported correctly

The IngestionWorkflow is correctly imported from workflows, allowing it to be registered in the system.

33-33: Workflow registration complete

The IngestionWorkflow is properly added to the WORKFLOWS list, making it available to the Temporal orchestration framework.

49-49: Activity registration complete

The process_document activity is properly added to the ACTIVITIES list, making it available to the Temporal orchestration framework.

hivemind_etl/simple_ingestion/schema.py (1)

1-40: Well-structured data model for ingestion requests

The IngestionRequest Pydantic model is well-defined with comprehensive documentation and appropriate default values.
hivemind_etl/simple_ingestion/pipeline.py (5)

1-9: Import organization looks good.

The imports are well-organized, with standard library imports first, followed by third-party imports and local imports. Clear separation between import groups enhances readability.

11-18: Well-structured workflow definition with clear documentation.

The workflow definition follows Temporal best practices with appropriate decorators and documentation. The class docstring clearly explains the purpose of the workflow.

19-36: Comprehensive documentation for the run method.

The docstring provides detailed information about parameters and implementation notes, which is excellent for maintainability and developer onboarding.

68-74: Collection name determination logic is correct.

The code properly handles both cases where a collection name is explicitly provided or needs to be constructed from community and platform IDs.

75-88:
Details

❓ Verification inconclusive

Add error handling and utilize all ingestion request properties.

The activity doesn't handle potential exceptions from the pipeline and doesn't utilize the excluded metadata keys from the ingestion request.
    # Initialize the ingestion pipeline
    pipeline = CustomIngestionPipeline(
        community_id=ingestion_request.communityId,
        collectionName=collection_name,
+       excluded_embed_metadata_keys=ingestion_request.excludedEmbedMetadataKeys,
+       excluded_llm_metadata_keys=ingestion_request.excludedLlmMetadataKeys,
    )

    document = Document(
        doc_id=ingestion_request.docId,
        text=ingestion_request.text,
        metadata=ingestion_request.metadata,
    )

+   try:
        pipeline.run_pipeline(docs=[document])
+       return {"status": "success", "doc_id": ingestion_request.docId}
+   except Exception as e:
+       import logging
+       logging.error(f"Document processing failed: {str(e)}")
+       raise  # Re-raise to trigger Temporal's retry mechanism
Please verify that the CustomIngestionPipeline class accepts excluded_embed_metadata_keys and excluded_llm_metadata_keys parameters:

🏁 Script executed:
#!/bin/bash
# Check if CustomIngestionPipeline accepts excluded metadata keys parameters
rg -A 10 "class CustomIngestionPipeline" tc_hivemind_backend/
Length of output: 124

I couldn’t find CustomIngestionPipeline under the path you tried. Let’s search the entire repository for its definition:

🏁 Script executed:
#!/bin/bash
# Locate CustomIngestionPipeline class definition and signature
rg -n "class CustomIngestionPipeline" -C 5 .
Length of output: 46

Let’s locate any pipeline.py files and inspect their class definitions and constructors:

🏁 Script executed:
#!/bin/bash
# Find pipeline.py files
echo "pipeline.py files:"
fd '^pipeline\.py$'

# For each pipeline.py, list class definitions and __init__ signatures
for f in $(fd '^pipeline\.py$'); do
  echo "---- $f ----"
  rg -n "^class " -C 3 "$f"
  rg -n "def __init__" -C 3 "$f"
done
Length of output: 628

Let’s search the entire repo for CustomIngestionPipeline definitions and usages:

🏁 Script executed:
#!/bin/bash
# Look for class definition
rg -n "class CustomIngestionPipeline" -C 5 .

# Look for any usage or import sites to infer its constructor signature
rg -n "CustomIngestionPipeline" -C 5 .
Length of output: 8199
Verify CustomIngestionPipeline constructor parameters

I wasn’t able to find the CustomIngestionPipeline class definition in this repo—it’s imported from tc_hivemind_backend.ingest_qdrant, which appears to be an external module. Please confirm that its __init__ signature supports the parameters:

excluded_embed_metadata_keys

excluded_llm_metadata_keys

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b791eab and a6463ff.

📒 Files selected for processing (3)

hivemind_etl/simple_ingestion/pipeline.py (1 hunks)
registry.py (3 hunks)
workflows.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

registry.py
hivemind_etl/simple_ingestion/pipeline.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

workflows.py (1)

hivemind_etl/simple_ingestion/pipeline.py (1)

IngestionWorkflow (14-50)

🪛 Ruff (0.8.2)

workflows.py

14-14: hivemind_etl.simple_ingestion.pipeline.IngestionWorkflow imported but unused

Remove unused import: hivemind_etl.simple_ingestion.pipeline.IngestionWorkflow

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: ci / test / Test

workflows.py

amindadgar linked an issue May 7, 2025 that may be closed by this pull request

Define ingestion workflow #35

Closed

coderabbitai bot reviewed May 7, 2025

View reviewed changes

amindadgar added 3 commits May 8, 2025 10:00

fix: IngestionWorkflow pipeline!

9a5c7b0

feat: removed test workflow!

c470353

fix: black linter issue!

a6463ff

coderabbitai bot reviewed May 8, 2025

View reviewed changes

workflows.py Show resolved Hide resolved

amindadgar merged commit 525bae0 into main May 8, 2025
3 checks passed

coderabbitai bot mentioned this pull request Jul 15, 2025

feat: add batch processing capabilities with BatchVectorIngestionWork… #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add simple ingestion workflow and document processing activities #36

feat: add simple ingestion workflow and document processing activities #36

Uh oh!

amindadgar commented May 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented May 7, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add simple ingestion workflow and document processing activities #36

feat: add simple ingestion workflow and document processing activities #36

Uh oh!

Conversation

amindadgar commented May 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amindadgar commented May 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)