[Gastown] PR 18: Container Resilience — Checkpoint/Restore

Parent: #204 | Phase 3: Multi-Rig + Scaling

## Goal

Handle the ephemeral disk problem. When a container sleeps or dies, in-flight state must be recoverable from DO state and remote git branches.

## Background

Cloudflare Containers have **ephemeral disk** — when a container sleeps or restarts, all filesystem state (git repos, worktrees, node_modules) is lost. Since all coordination state lives in DOs, the main recovery concern is git state.

## Strategy

### 1. Git State Recovery

On container start, the control server reads Rig DO state to determine which rigs need repos cloned and which agents need worktrees:

```
Container starts → control server boots
→ Reads rig registry from Town DO
→ For each rig with active agents:
  → Clone repo (or pull if warm)
  → Create worktrees for active agent branches (branches exist on remote)
→ Report ready to DO
→ DO alarm dispatches pending agents
```

### 2. Uncommitted Work Protection

Agents should commit and push frequently. The polecat system prompt instructs:
- Commit after meaningful progress (not just at `gt_done`)
- Push branch to remote after each commit
- Use `gt_checkpoint` to write recovery metadata to the DO

### 3. Checkpoint/Restore via DO

The `gt_checkpoint` tool writes JSON to the DO's agent record. On restart, `gt_prime` includes the checkpoint in the agent's context so it can resume from where it left off.

### 4. Proactive Git Push

The polecat system prompt instructs agents to push their branch after meaningful progress, not just at `gt_done`. This ensures the remote has latest state for recovery.

## Dependencies

- PR 4 (Town Container)
- PR 5 (Rig DO Alarm)
- PR 9 (Town DO — rig registry)

## Acceptance Criteria

- [ ] Container startup sequence reads DO state and restores git environment
- [ ] Active agent worktrees re-created from remote branches on restart
- [ ] `gt_checkpoint` data included in `gt_prime` context for recovery
- [ ] System prompt updates instructing frequent commit/push
- [ ] Container health endpoint reports recovery progress
- [ ] Integration test: container sleep → wake → agents resume work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gastown] PR 18: Container Resilience — Checkpoint/Restore #269

Goal

Background

Strategy

1. Git State Recovery

2. Uncommitted Work Protection

3. Checkpoint/Restore via DO

4. Proactive Git Push

Dependencies

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Gastown] PR 18: Container Resilience — Checkpoint/Restore #269

Description

Goal

Background

Strategy

1. Git State Recovery

2. Uncommitted Work Protection

3. Checkpoint/Restore via DO

4. Proactive Git Push

Dependencies

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions