Skip to content

feat: Worker failure detection and recovery for fault tolerance #3

@yasha-dev1

Description

@yasha-dev1

Summary

Implement worker failure detection and automatic recovery for both transient and durable workflows in PyWorkflow with Celery.

Problem

When a Celery worker crashes mid-execution:

  • Workflow stays in RUNNING status indefinitely (orphaned)
  • No distinction between application failures (bugs) and infrastructure failures (worker crash)
  • Current config (task_reject_on_worker_lost=True) requeues the task, but no workflow state update occurs

Proposed Solution

1. New Event Type: WORKFLOW_INTERRUPTED

Introduce a new event type to distinguish infrastructure failures from application errors:

class EventType(str, Enum):
    WORKFLOW_INTERRUPTED = "workflow.interrupted"

Event data:

{
    "reason": "worker_lost" | "timeout" | "signal",
    "worker_id": str,
    "last_event_sequence": int,
    "error": str | None,
    "recoverable": bool
}

2. New RunStatus: INTERRUPTED

class RunStatus(str, Enum):
    INTERRUPTED = "interrupted"  # Recoverable infrastructure failure

3. Recovery Configuration

@workflow(
    durable=True,
    recover_on_worker_loss=True,   # Default: True for durable, False for transient
    max_recovery_attempts=3,
)
async def my_workflow():
    ...

4. Behavior by Mode

Transient workflows (durable=False):

  • Default: recover_on_worker_loss=False
  • On worker loss: Mark as FAILED (no state to recover from)
  • If recover_on_worker_loss=True: Reschedule from scratch

Durable workflows (durable=True):

  • Default: recover_on_worker_loss=True
  • On worker loss: Record WORKFLOW_INTERRUPTED event, status → INTERRUPTED
  • Reschedule task → replay events → continue from last checkpoint
  • Track recovery attempts to prevent infinite loops

5. Recovery Flow

Worker crashes during workflow execution
    ↓
Celery detects worker loss (task_reject_on_worker_lost=True)
    ↓
on_failure() callback triggered
    ↓
Check workflow config (recover_on_worker_loss flag)
    ↓
If durable & recover_on_worker_loss=True:
    → Record WORKFLOW_INTERRUPTED event
    → Increment recovery_attempts in WorkflowRun
    → If attempts < max_recovery_attempts:
        → Reschedule task
        → New worker picks up
        → Replay events to last checkpoint
        → Continue execution
    → Else:
        → Mark FAILED (exceeded recovery attempts)

Why New Event Type vs Reusing WORKFLOW_FAILED

Aspect WORKFLOW_FAILED WORKFLOW_INTERRUPTED
Cause Application error Infrastructure failure
Default action Stop (need code fix) Auto-retry/resume
Who investigates Developers Ops team
Recovery Manual rerun Automatic

Implementation Status

  • Add WORKFLOW_INTERRUPTED event type and helper function
  • Add RunStatus.INTERRUPTED status
  • Add recovery_attempts field to WorkflowRun schema
  • Add recover_on_worker_loss and max_recovery_attempts config options
  • Implement on_failure() callback in Celery WorkflowTask
  • Implement recovery scheduling logic
  • Update replay mechanism to handle WORKFLOW_INTERRUPTED
  • Write unit tests (18 tests added)
  • Write integration tests (12 tests added)
  • Update documentation

Files Modified

  1. pyworkflow/engine/events.py - Add WORKFLOW_INTERRUPTED event type
  2. pyworkflow/storage/schemas.py - Add INTERRUPTED status, recovery_attempts field
  3. pyworkflow/core/workflow.py - Add new decorator parameters
  4. pyworkflow/celery/tasks.py - Implement failure callbacks and recovery logic
  5. pyworkflow/engine/replay.py - Handle WORKFLOW_INTERRUPTED in replay
  6. pyworkflow/config.py - Add recovery configuration options
  7. pyworkflow/storage/base.py - Add update_run_recovery_attempts method
  8. pyworkflow/storage/file.py - Implement update_run_recovery_attempts
  9. pyworkflow/storage/memory.py - Implement update_run_recovery_attempts

Test Coverage

  • Unit tests: 18 new tests covering event creation, status transitions, config parsing
  • Integration tests: 12 new tests covering recovery flow, event replay, status transitions
  • All tests passing: 98 total tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions