Parent: #204 | Phase 3: Multi-Rig + Scaling
Goal
Handle the ephemeral disk problem. When a container sleeps or dies, in-flight state must be recoverable from DO state and remote git branches.
Background
Cloudflare Containers have ephemeral disk — when a container sleeps or restarts, all filesystem state (git repos, worktrees, node_modules) is lost. Since all coordination state lives in DOs, the main recovery concern is git state.
Strategy
1. Git State Recovery
On container start, the control server reads Rig DO state to determine which rigs need repos cloned and which agents need worktrees:
Container starts → control server boots
→ Reads rig registry from Town DO
→ For each rig with active agents:
→ Clone repo (or pull if warm)
→ Create worktrees for active agent branches (branches exist on remote)
→ Report ready to DO
→ DO alarm dispatches pending agents
2. Uncommitted Work Protection
Agents should commit and push frequently. The polecat system prompt instructs:
- Commit after meaningful progress (not just at
gt_done)
- Push branch to remote after each commit
- Use
gt_checkpoint to write recovery metadata to the DO
3. Checkpoint/Restore via DO
The gt_checkpoint tool writes JSON to the DO's agent record. On restart, gt_prime includes the checkpoint in the agent's context so it can resume from where it left off.
4. Proactive Git Push
The polecat system prompt instructs agents to push their branch after meaningful progress, not just at gt_done. This ensures the remote has latest state for recovery.
Dependencies
- PR 4 (Town Container)
- PR 5 (Rig DO Alarm)
- PR 9 (Town DO — rig registry)
Acceptance Criteria
Parent: #204 | Phase 3: Multi-Rig + Scaling
Goal
Handle the ephemeral disk problem. When a container sleeps or dies, in-flight state must be recoverable from DO state and remote git branches.
Background
Cloudflare Containers have ephemeral disk — when a container sleeps or restarts, all filesystem state (git repos, worktrees, node_modules) is lost. Since all coordination state lives in DOs, the main recovery concern is git state.
Strategy
1. Git State Recovery
On container start, the control server reads Rig DO state to determine which rigs need repos cloned and which agents need worktrees:
2. Uncommitted Work Protection
Agents should commit and push frequently. The polecat system prompt instructs:
gt_done)gt_checkpointto write recovery metadata to the DO3. Checkpoint/Restore via DO
The
gt_checkpointtool writes JSON to the DO's agent record. On restart,gt_primeincludes the checkpoint in the agent's context so it can resume from where it left off.4. Proactive Git Push
The polecat system prompt instructs agents to push their branch after meaningful progress, not just at
gt_done. This ensures the remote has latest state for recovery.Dependencies
Acceptance Criteria
gt_checkpointdata included ingt_primecontext for recovery