feat(storage): add configurable data retention policy#239
Conversation
Adds a daily Celery Beat task that purges completed/failed/cancelled/ continued_as_new/interrupted workflow runs older than a configurable cutoff, preventing unbounded storage growth. Key changes: - PyWorkflowConfig.data_retention_days (None = keep forever) loaded from PYWORKFLOW_DATA_RETENTION_DAYS env var or retention_days YAML key - StorageBackend.delete_old_runs(older_than) abstract method implemented in all 7 backends (postgres, sqlite, mysql, memory, file, dynamodb, cassandra); citus inherits from postgres - run_data_retention_task: singleton Celery task, self-skips when retention is unconfigured, release_lock_on_failure=True so a crash never blocks the next day's run - beat_schedule entry: runs every 24 hours on pyworkflow.default queue - 13 unit tests covering delete semantics, related-data cleanup, and config loading Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
🔄 New commits pushed — requesting re-review. Commit: |
🔍 Code Review Agent — Tier 3Commit: I have now read all the relevant files. Let me compile the review. SummaryThis PR adds a configurable data retention policy to PyWorkflow. It introduces Risk AssessmentTier 3 — confirmed. The PR touches Issues1. [blocking] The Cassandra 2. [blocking]
3. [blocking] When hook files are deleted in 4. [warning]
ArchitectureAll changes comply with the Test CoverageTest coverage is inadequate for a Tier 3 PR:
The two blocking file-backend bugs (issues 2 and 3) would have been caught by a basic integration test against 🤖 Code Review Agent — automated code review. |
Summary
data_retention_daystoPyWorkflowConfig(env:PYWORKFLOW_DATA_RETENTION_DAYS, YAML:retention_days);None= keep foreverStorageBackend.delete_old_runs(older_than)method implemented across all 7 backends (postgres, sqlite, mysql, memory, file, dynamodb, cassandra); citus inherits from postgres unchangedpyworkflow.run_data_retention— singleton, self-skips when unconfigured,release_lock_on_failure=Trueto prevent stale locks blocking the next day's runRetention scope
Deletes runs in terminal states (
completed,failed,cancelled,continued_as_new,interrupted) whereupdated_at < now - data_retention_days. Active states (running,suspended,pending) are never deleted. Related rows (events, steps, hooks, cancellation flags) are deleted in the same transaction/operation.Risk tier
Tier 2 — new feature touching storage layer and Celery task registration; no changes to existing execution paths or replay logic.
Test plan
pytest tests/unit/test_retention.py— 13/13 passingpytest tests/unit/— 769/769 passing (no regressions)ruff check— cleanmypy pyworkflow/— no new errors introduced🤖 Generated with Claude Code