Summary
Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (control-server.ts:649-654) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.
From Cloudflare docs:
When a container instance is going to be shut down, it is sent a SIGTERM signal, and then a SIGKILL signal after 15 minutes. You should perform any necessary cleanup to ensure a graceful shutdown in this time. The container instance will be rebooted elsewhere shortly after this.
Current Behavior
// control-server.ts:649-654
const shutdown = async () => {
stopHeartbeat();
await stopAll(); // Immediately aborts all sessions
process.exit(0);
};
process.on("SIGTERM", () => void shutdown());
stopAll() aborts every session via session.abort(), sets all agents to exited, and kills SDK servers. No git push, no checkpoint save, no TownDO notification. Agents lose all uncommitted work.
Proposed Behavior
Phase 1: Notify TownDO (immediate, on SIGTERM)
Container sends POST /api/towns/:townId/container-eviction to the worker. The TownDO:
- Inserts a
container_eviction event
- The reconciler processes this event and:
- Sets a
draining flag on the town (or container metadata)
- Stops emitting
dispatch_agent actions for new work
- Does NOT interrupt currently running agents
Phase 2: Nudge running agents to save and park (first 2 minutes)
For each running agent, inject a nudge message:
- Polecats: "The container is shutting down. Please commit and push your current changes immediately, then call gt_done. You have 2 minutes."
- Refinery: "The container is shutting down. If your review is complete, call gt_done now. Otherwise, your work will be saved and the review will resume after restart."
- Mayor: No nudge needed — the mayor has no uncommitted work. Its conversation history is already in AgentDO events.
Phase 3: Wait for agents to finish (up to 10 minutes)
Monitor running agents. As each one calls gt_done or finishes its current turn, it exits cleanly. The container waits until:
- All agents have exited, OR
- 10 minutes have elapsed (leaving 5 min buffer before SIGKILL)
Phase 4: Force save and exit (last 5 minutes)
For any agents still running after 10 minutes:
- Force
git add -A && git commit -m "WIP: container eviction save" && git push via the SDK's bash tool or direct git execution in the worktree
- Abort the session
- Report
agentCompleted with a reason indicating eviction save
Phase 5: Clean exit
stopAll() as today (cleanup SDK instances)
process.exit(0)
Reconciler Integration
The container_eviction event is processed by applyEvent:
case "container_eviction": {
// Set draining flag — reconciler checks this before emitting dispatch_agent
setTownDraining(sql, true);
return;
}
reconcileBeads Rule 1 and reconcileReviewQueue Rule 5 check the draining flag:
if (isTownDraining(sql)) return []; // No new dispatches
After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and the reconciler resumes normal dispatch. The container_status observation pre-phase in the alarm would detect agents that were running before eviction are now gone, setting them to idle and triggering normal recovery.
Combined with #1236 (session persistence via AgentDO event reconstruction), agents that were mid-work will be re-dispatched to the new container with their conversation history intact.
Timing Budget
T+0:00 SIGTERM received
T+0:01 Notify TownDO (draining), nudge agents to save
T+0:01 Agents begin saving (commit, push, gt_done)
T+2:00 Most agents have finished saving
T+10:00 Force-save remaining agents (git commit + push)
T+10:30 stopAll(), report completions
T+11:00 process.exit(0)
T+15:00 SIGKILL (we should be long gone)
Files
container/src/control-server.ts — SIGTERM handler (line 649-657)
container/src/process-manager.ts — stopAll() (line 759), new drainAll() function
src/dos/town/reconciler.ts — draining flag check in dispatch rules
src/dos/town/events.ts — new container_eviction event type
src/dos/Town.do.ts — new /container-eviction endpoint
References
Summary
Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (
control-server.ts:649-654) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.From Cloudflare docs:
Current Behavior
stopAll()aborts every session viasession.abort(), sets all agents toexited, and kills SDK servers. No git push, no checkpoint save, no TownDO notification. Agents lose all uncommitted work.Proposed Behavior
Phase 1: Notify TownDO (immediate, on SIGTERM)
Container sends
POST /api/towns/:townId/container-evictionto the worker. The TownDO:container_evictioneventdrainingflag on the town (or container metadata)dispatch_agentactions for new workPhase 2: Nudge running agents to save and park (first 2 minutes)
For each running agent, inject a nudge message:
Phase 3: Wait for agents to finish (up to 10 minutes)
Monitor running agents. As each one calls
gt_doneor finishes its current turn, it exits cleanly. The container waits until:Phase 4: Force save and exit (last 5 minutes)
For any agents still running after 10 minutes:
git add -A && git commit -m "WIP: container eviction save" && git pushvia the SDK's bash tool or direct git execution in the worktreeagentCompletedwith a reason indicating eviction savePhase 5: Clean exit
stopAll()as today (cleanup SDK instances)process.exit(0)Reconciler Integration
The
container_evictionevent is processed byapplyEvent:reconcileBeadsRule 1 andreconcileReviewQueueRule 5 check the draining flag:After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and the reconciler resumes normal dispatch. The
container_statusobservation pre-phase in the alarm would detect agents that were running before eviction are now gone, setting them to idle and triggering normal recovery.Combined with #1236 (session persistence via AgentDO event reconstruction), agents that were mid-work will be re-dispatched to the new container with their conversation history intact.
Timing Budget
Files
container/src/control-server.ts— SIGTERM handler (line 649-657)container/src/process-manager.ts—stopAll()(line 759), newdrainAll()functionsrc/dos/town/reconciler.ts— draining flag check in dispatch rulessrc/dos/town/events.ts— newcontainer_evictionevent typesrc/dos/Town.do.ts— new/container-evictionendpointReferences