Skip to content

Phase 2 W3c — Active/Standby Controller (OPS-04 / INVARIANT 33)#14

Merged
l17728 merged 15 commits into
mainfrom
feat/phase-2-w3c-active-standby
May 15, 2026
Merged

Phase 2 W3c — Active/Standby Controller (OPS-04 / INVARIANT 33)#14
l17728 merged 15 commits into
mainfrom
feat/phase-2-w3c-active-standby

Conversation

@l17728
Copy link
Copy Markdown
Owner

@l17728 l17728 commented May 15, 2026

Summary

  • App-level leader election via pg_try_advisory_lock on a dedicated NullPool engine. PG auto-releases the lock when the lock-holding session ends → crash-failover with no lease/expiry logic. Closes Phase 2 exit "Active/standby switch RTO ≤ 10min (CH-Q1)".
  • run_leader_loop drives standby → recovering → active with on_promote (runs run_recovery_routine) / on_active (spawns sweep) / on_step_down (cancels sweep) callbacks. Failures in either callback fail-safe (stay in recovering and retry / revert to recovering).
  • main.py lifespan restructured — recovery + sweep are now leader-gated; DLW_STRICT_RECOVERY env knob deleted (the leader loop's retry-on-promote-failure replaces it). W3a auth bootstrap stays unconditional so promotion is instant.
  • GET /health/active — LB target (200 iff this instance holds the lock).
  • require_not_recovering 503 barrier on heartbeat / poll / report — INVARIANT 33. HF proxy intentionally NOT barriered (streaming passthrough doesn't mutate state).
  • make_app_with_state conftest helper migrated 7 existing fixture sites (DRY + correctness).
  • Chaos drill (tests/e2e/test_failover_drill.py) — two real LeaderElector instances against the test PG; kill the active; assert standby promotes within ≤ 3 × poll_interval.
  • Zero schema changes / no alembic migration, no new runtime deps.

Spec: docs/superpowers/specs/2026-05-15-phase-2-w3c-active-standby-design.md
Plan: docs/superpowers/plans/2026-05-15-phase-2-w3c-active-standby.md

Test plan

  • uv run pytest tests/services/test_leader_election.py — 6 LeaderElector cases (acquire / release / crash-failover / verify)
  • uv run pytest tests/services/test_leader_loop.py — 4 state-machine cases (polling, recovery retry, step-down, shutdown)
  • uv run pytest tests/api/test_health_active.py — 3 cases (standby 503 / recovering 200 / active 200)
  • uv run pytest tests/api/test_recovery_barrier.py — 3 cases (heartbeat/poll/report 503 while recovering, active passthrough)
  • uv run pytest tests/executor/test_client.py::test_controller_recovering_503_is_retried_by_tenacity — proves tenacity already handles the 503
  • uv run pytest tests/e2e/test_failover_drill.py — 3 chaos-drill cases (mutual exclusion, RTO promotion, recovery callback fires)
  • uv run pytest -q — full suite green (263 passed, 1 deselected)
  • uv run python tools/lint_invariants.py — OK

Known minor follow-ups (non-blocking, from final review)

  • Lock-loss during in-flight on_promote has a bounded "harmless duplicate sweeps" window (spec §6 accepts this; SKIP LOCKED at the DB layer + idempotent run_recovery_routine make it safe).
  • Failover-drill tests reach into LeaderElector._conn to simulate crash; this is documented intentional ("simulating crash, not graceful release") — a public _simulate_crash() test seam would be cleaner future hygiene.

🤖 Generated with Claude Code

l17728 and others added 15 commits May 15, 2026 16:29
…rtial)

PG session advisory lock (pg_try_advisory_lock) for app-level leader election;
standby → recovering → active state machine; /health/active LB target;
INVARIANT-33 503 barrier on the three executor-loop endpoints. Zero schema
changes. Single-shared-PG design — PG HA is Phase 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…an (8 tasks)

3 milestones: lock primitive (LeaderElector + run_leader_loop), leader loop +
health endpoint + recovery barrier + lifespan restructure, chaos drill + OpenAPI
+ runbook + PR. TDD bite-sized steps, complete code, zero alembic migrations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…errors (W3c)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test (W3c)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nts (W3c)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…CT_RECOVERY (W3c)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures: active_lock_id bigint upper bound + env-override test, _db_url
docstring correction + cleanup debug log, run_leader_loop split into
recovering/active branches + on_active exception guard, conftest helper
defaults, recovery barrier wiring, lifespan shutdown await fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@l17728 l17728 merged commit db86791 into main May 15, 2026
12 checks passed
@l17728 l17728 deleted the feat/phase-2-w3c-active-standby branch May 15, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant