feat: Worker failure detection and recovery for fault tolerance

## Summary

Implement worker failure detection and automatic recovery for both transient and durable workflows in PyWorkflow with Celery.

## Problem

When a Celery worker crashes mid-execution:
- Workflow stays in `RUNNING` status indefinitely (orphaned)
- No distinction between application failures (bugs) and infrastructure failures (worker crash)
- Current config (`task_reject_on_worker_lost=True`) requeues the task, but no workflow state update occurs

## Proposed Solution

### 1. New Event Type: `WORKFLOW_INTERRUPTED`

Introduce a new event type to distinguish infrastructure failures from application errors:

```python
class EventType(str, Enum):
    WORKFLOW_INTERRUPTED = "workflow.interrupted"
```

Event data:
```python
{
    "reason": "worker_lost" | "timeout" | "signal",
    "worker_id": str,
    "last_event_sequence": int,
    "error": str | None,
    "recoverable": bool
}
```

### 2. New RunStatus: `INTERRUPTED`

```python
class RunStatus(str, Enum):
    INTERRUPTED = "interrupted"  # Recoverable infrastructure failure
```

### 3. Recovery Configuration

```python
@workflow(
    durable=True,
    recover_on_worker_loss=True,   # Default: True for durable, False for transient
    max_recovery_attempts=3,
)
async def my_workflow():
    ...
```

### 4. Behavior by Mode

**Transient workflows (`durable=False`):**
- Default: `recover_on_worker_loss=False`
- On worker loss: Mark as FAILED (no state to recover from)
- If `recover_on_worker_loss=True`: Reschedule from scratch

**Durable workflows (`durable=True`):**
- Default: `recover_on_worker_loss=True`
- On worker loss: Record `WORKFLOW_INTERRUPTED` event, status → INTERRUPTED
- Reschedule task → replay events → continue from last checkpoint
- Track recovery attempts to prevent infinite loops

### 5. Recovery Flow

```
Worker crashes during workflow execution
    ↓
Celery detects worker loss (task_reject_on_worker_lost=True)
    ↓
on_failure() callback triggered
    ↓
Check workflow config (recover_on_worker_loss flag)
    ↓
If durable & recover_on_worker_loss=True:
    → Record WORKFLOW_INTERRUPTED event
    → Increment recovery_attempts in WorkflowRun
    → If attempts < max_recovery_attempts:
        → Reschedule task
        → New worker picks up
        → Replay events to last checkpoint
        → Continue execution
    → Else:
        → Mark FAILED (exceeded recovery attempts)
```

## Why New Event Type vs Reusing WORKFLOW_FAILED

| Aspect | WORKFLOW_FAILED | WORKFLOW_INTERRUPTED |
|--------|-----------------|---------------------|
| Cause | Application error | Infrastructure failure |
| Default action | Stop (need code fix) | Auto-retry/resume |
| Who investigates | Developers | Ops team |
| Recovery | Manual rerun | Automatic |

## Implementation Status

- [x] Add `WORKFLOW_INTERRUPTED` event type and helper function
- [x] Add `RunStatus.INTERRUPTED` status
- [x] Add `recovery_attempts` field to WorkflowRun schema
- [x] Add `recover_on_worker_loss` and `max_recovery_attempts` config options
- [x] Implement `on_failure()` callback in Celery WorkflowTask
- [x] Implement recovery scheduling logic
- [x] Update replay mechanism to handle WORKFLOW_INTERRUPTED
- [x] Write unit tests (18 tests added)
- [x] Write integration tests (12 tests added)
- [x] Update documentation

## Files Modified

1. `pyworkflow/engine/events.py` - Add `WORKFLOW_INTERRUPTED` event type
2. `pyworkflow/storage/schemas.py` - Add `INTERRUPTED` status, `recovery_attempts` field
3. `pyworkflow/core/workflow.py` - Add new decorator parameters
4. `pyworkflow/celery/tasks.py` - Implement failure callbacks and recovery logic
5. `pyworkflow/engine/replay.py` - Handle `WORKFLOW_INTERRUPTED` in replay
6. `pyworkflow/config.py` - Add recovery configuration options
7. `pyworkflow/storage/base.py` - Add `update_run_recovery_attempts` method
8. `pyworkflow/storage/file.py` - Implement `update_run_recovery_attempts`
9. `pyworkflow/storage/memory.py` - Implement `update_run_recovery_attempts`

## Test Coverage

- **Unit tests**: 18 new tests covering event creation, status transitions, config parsing
- **Integration tests**: 12 new tests covering recovery flow, event replay, status transitions
- **All tests passing**: 98 total tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Worker failure detection and recovery for fault tolerance #3

Summary

Problem

Proposed Solution

1. New Event Type: `WORKFLOW_INTERRUPTED`

2. New RunStatus: `INTERRUPTED`

3. Recovery Configuration

4. Behavior by Mode

5. Recovery Flow

Why New Event Type vs Reusing WORKFLOW_FAILED

Implementation Status

Files Modified

Test Coverage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	WORKFLOW_FAILED	WORKFLOW_INTERRUPTED
Cause	Application error	Infrastructure failure
Default action	Stop (need code fix)	Auto-retry/resume
Who investigates	Developers	Ops team
Recovery	Manual rerun	Automatic

feat: Worker failure detection and recovery for fault tolerance #3

Description

Summary

Problem

Proposed Solution

1. New Event Type: WORKFLOW_INTERRUPTED

2. New RunStatus: INTERRUPTED

3. Recovery Configuration

4. Behavior by Mode

5. Recovery Flow

Why New Event Type vs Reusing WORKFLOW_FAILED

Implementation Status

Files Modified

Test Coverage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. New Event Type: `WORKFLOW_INTERRUPTED`

2. New RunStatus: `INTERRUPTED`