Phase 2 W3c — Active/Standby Controller (OPS-04 / INVARIANT 33)#14
Merged
Conversation
…rtial) PG session advisory lock (pg_try_advisory_lock) for app-level leader election; standby → recovering → active state machine; /health/active LB target; INVARIANT-33 503 barrier on the three executor-loop endpoints. Zero schema changes. Single-shared-PG design — PG HA is Phase 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…an (8 tasks) 3 milestones: lock primitive (LeaderElector + run_leader_loop), leader loop + health endpoint + recovery barrier + lifespan restructure, chaos drill + OpenAPI + runbook + PR. TDD bite-sized steps, complete code, zero alembic migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…errors (W3c) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test (W3c) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nts (W3c) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…CT_RECOVERY (W3c) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures: active_lock_id bigint upper bound + env-override test, _db_url docstring correction + cleanup debug log, run_leader_loop split into recovering/active branches + on_active exception guard, conftest helper defaults, recovery barrier wiring, lifespan shutdown await fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pg_try_advisory_lockon a dedicated NullPool engine. PG auto-releases the lock when the lock-holding session ends → crash-failover with no lease/expiry logic. Closes Phase 2 exit "Active/standby switch RTO ≤ 10min (CH-Q1)".run_leader_loopdrivesstandby → recovering → activewithon_promote(runsrun_recovery_routine) /on_active(spawns sweep) /on_step_down(cancels sweep) callbacks. Failures in either callback fail-safe (stay inrecoveringand retry / revert torecovering).main.pylifespan restructured — recovery + sweep are now leader-gated;DLW_STRICT_RECOVERYenv knob deleted (the leader loop's retry-on-promote-failure replaces it). W3a auth bootstrap stays unconditional so promotion is instant.GET /health/active— LB target (200 iff this instance holds the lock).require_not_recovering503 barrier on heartbeat / poll / report — INVARIANT 33. HF proxy intentionally NOT barriered (streaming passthrough doesn't mutate state).make_app_with_stateconftest helper migrated 7 existing fixture sites (DRY + correctness).tests/e2e/test_failover_drill.py) — two realLeaderElectorinstances against the test PG; kill the active; assert standby promotes within ≤ 3 × poll_interval.Spec:
docs/superpowers/specs/2026-05-15-phase-2-w3c-active-standby-design.mdPlan:
docs/superpowers/plans/2026-05-15-phase-2-w3c-active-standby.mdTest plan
uv run pytest tests/services/test_leader_election.py— 6 LeaderElector cases (acquire / release / crash-failover / verify)uv run pytest tests/services/test_leader_loop.py— 4 state-machine cases (polling, recovery retry, step-down, shutdown)uv run pytest tests/api/test_health_active.py— 3 cases (standby 503 / recovering 200 / active 200)uv run pytest tests/api/test_recovery_barrier.py— 3 cases (heartbeat/poll/report 503 while recovering, active passthrough)uv run pytest tests/executor/test_client.py::test_controller_recovering_503_is_retried_by_tenacity— proves tenacity already handles the 503uv run pytest tests/e2e/test_failover_drill.py— 3 chaos-drill cases (mutual exclusion, RTO promotion, recovery callback fires)uv run pytest -q— full suite green (263 passed, 1 deselected)uv run python tools/lint_invariants.py— OKKnown minor follow-ups (non-blocking, from final review)
on_promotehas a bounded "harmless duplicate sweeps" window (spec §6 accepts this; SKIP LOCKED at the DB layer + idempotentrun_recovery_routinemake it safe).LeaderElector._connto simulate crash; this is documented intentional ("simulating crash, not graceful release") — a public_simulate_crash()test seam would be cleaner future hygiene.🤖 Generated with Claude Code