Parent: #204 | Phase 4: Hardening
Revised: Edge cases updated for container-per-town model (container OOM, ephemeral disk, process-level isolation).
Goal
Handle edge cases and failure modes gracefully.
Edge Cases
- Split-brain: Two processes for the same agent (race on restart) → Rig DO enforces single-writer per agent, container checks DO state before starting
- Concurrent writes to same bead: SQLite serialization in DO handles this, but add optimistic locking for cross-DO operations
- DO eviction during alarm: Alarms are durable and will re-fire
- Container OOM: Kills all agents. DO alarms detect dead agents, new container starts, agents re-dispatched from DO state
- Container sleep during active work: Agents must have pushed to remote. DO re-dispatches on wake. Checkpoint data in DO enables resumption
- Gateway outage: Agent retries built into Kilo CLI; escalation if persistent
- Partial
agentDone: What if the polecat pushed the branch but the gt_done call failed? Checkpoint-based recovery
- Duplicate mail delivery: Idempotency on mail delivery marking
- Convoy with failed beads: Policy for partial convoy completion
- Git worktree conflicts: Two agents accidentally assigned same branch → Rig DO enforces unique branch per agent
Dependencies
- PR 5 (Rig DO Alarm — witness patrol)
- PR 10 (Multiple Polecats)
Acceptance Criteria
Parent: #204 | Phase 4: Hardening
Goal
Handle edge cases and failure modes gracefully.
Edge Cases
agentDone: What if the polecat pushed the branch but thegt_donecall failed? Checkpoint-based recoveryDependencies
Acceptance Criteria