Skip to content

feat(audit_dispatch): Phase 6 — crash-safe persistent lock + dual-PID…#68

Merged
Velascat merged 1 commit intomainfrom
phase6-dispatch-control
May 4, 2026
Merged

feat(audit_dispatch): Phase 6 — crash-safe persistent lock + dual-PID…#68
Velascat merged 1 commit intomainfrom
phase6-dispatch-control

Conversation

@Velascat
Copy link
Copy Markdown
Owner

@Velascat Velascat commented May 4, 2026

… tracking

Replace in-memory lock with on-disk PersistentLockStore so audit dispatch survives OperationsCenter restarts and is exclusive across OC processes.

Slice A — lock_store.py:

  • PersistentLockPayload (frozen dataclass) carries oc_pid + audit_pid + audit_pgid + run_id + audit_type + command + expected_output_dir
  • Atomic write-tempfile + os.replace
  • fcntl.flock sentinel via existing audit_governance/file_locks helper (lazy-imported to avoid audit_governance package-init cycle)
  • _iter_lock_files() filters out *.lock.lock recursive sentinels
  • LOCK_SCHEMA_VERSION = 1

Slice B — identity wiring:

  • locks.py refactored as a façade over PersistentLockStore; preserves ManagedRepoAuditLock context-manager surface
  • acquire_audit_lock(repo_id, *, run_id, audit_type, oc_pid, command, expected_run_status_path) carries identity into the payload
  • executor.execute(..., on_spawn=callback) patches audit_pid + audit_pgid immediately after Popen
  • api.py resolves absolute expected output dir, threads through

Slice C — stale reclaim:

  • Lock alive iff any recorded PID is alive (os.kill(pid, 0))
  • Lazy first-use sweep_stale() recovers from OC crash
  • Corrupt lock files treated as stale; new LockStoreCorruptError + StaleLockReclaimedWarning

Slice D — CLI commands:

  • list-active (Rich table or --json) shows all held locks + per-PID liveness
  • unlock --repo X [--force] refuses to release a lock with any alive PID unless --force
  • dispatch positional alias for run

Slice E — cross-process concurrency proof:

  • test_lock_store_concurrency.py spawns two real subprocesses competing for the same repo lock; asserts exactly one acquires

Slice F — in-flight watcher:

  • watcher.py exposes poll_run_status(expected_output_dir, run_id) iterator yielding RunStatusSnapshot on each on-disk content change; terminates on completed/failed/interrupted; locates VF buckets by run_id substring; no watchdog runtime dep
  • watch CLI command streams transitions until terminal status

Tests: 64 new (22 lock_store + 6 locks (Slice B+C) + 9 cli + 2 cross-process

  • 4 watcher + 21 dispatch alias / unlock / list-active flavors). Full unit suite: 2041 pass (architecture_invariants pre-broken).

Summary

Changes

Invariant Checklist

  • Contracts remain single-sourced in contracts/
  • Policy gate is still enforced before adapter dispatch
  • No alternate execution paths introduced
  • Failures are explicit — no silent fallback routing

Testing

  • Tests pass: .venv/bin/python -m pytest tests/ -v
  • Linter passes: ruff check src/
  • New behavior is covered by tests
  • Policy-blocking case tested (if policy-related change)

Related Issues

Notes for Reviewer

… tracking

Replace in-memory lock with on-disk PersistentLockStore so audit dispatch
survives OperationsCenter restarts and is exclusive across OC processes.

Slice A — lock_store.py:
- PersistentLockPayload (frozen dataclass) carries oc_pid + audit_pid +
  audit_pgid + run_id + audit_type + command + expected_output_dir
- Atomic write-tempfile + os.replace
- fcntl.flock sentinel via existing audit_governance/file_locks helper
  (lazy-imported to avoid audit_governance package-init cycle)
- _iter_lock_files() filters out *.lock.lock recursive sentinels
- LOCK_SCHEMA_VERSION = 1

Slice B — identity wiring:
- locks.py refactored as a façade over PersistentLockStore; preserves
  ManagedRepoAuditLock context-manager surface
- acquire_audit_lock(repo_id, *, run_id, audit_type, oc_pid, command,
  expected_run_status_path) carries identity into the payload
- executor.execute(..., on_spawn=callback) patches audit_pid + audit_pgid
  immediately after Popen
- api.py resolves absolute expected output dir, threads through

Slice C — stale reclaim:
- Lock alive iff any recorded PID is alive (os.kill(pid, 0))
- Lazy first-use sweep_stale() recovers from OC crash
- Corrupt lock files treated as stale; new LockStoreCorruptError +
  StaleLockReclaimedWarning

Slice D — CLI commands:
- list-active (Rich table or --json) shows all held locks + per-PID liveness
- unlock --repo X [--force] refuses to release a lock with any alive PID
  unless --force
- dispatch <repo> <type> positional alias for run

Slice E — cross-process concurrency proof:
- test_lock_store_concurrency.py spawns two real subprocesses competing
  for the same repo lock; asserts exactly one acquires

Slice F — in-flight watcher:
- watcher.py exposes poll_run_status(expected_output_dir, run_id) iterator
  yielding RunStatusSnapshot on each on-disk content change; terminates on
  completed/failed/interrupted; locates VF buckets by run_id substring;
  no watchdog runtime dep
- watch CLI command streams transitions until terminal status

Tests: 64 new (22 lock_store + 6 locks (Slice B+C) + 9 cli + 2 cross-process
+ 4 watcher + 21 dispatch alias / unlock / list-active flavors).
Full unit suite: 2041 pass (architecture_invariants pre-broken).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Velascat Velascat merged commit c5729f9 into main May 4, 2026
4 of 10 checks passed
@Velascat Velascat deleted the phase6-dispatch-control branch May 4, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant