Skip to content

feat(gastown): Graceful container eviction — SIGTERM draining with agent save-and-park #1441

@jrf0110

Description

@jrf0110

Summary

Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (control-server.ts:649-654) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.

From Cloudflare docs:

When a container instance is going to be shut down, it is sent a SIGTERM signal, and then a SIGKILL signal after 15 minutes. You should perform any necessary cleanup to ensure a graceful shutdown in this time. The container instance will be rebooted elsewhere shortly after this.

Current Behavior

// control-server.ts:649-654
const shutdown = async () => {
  stopHeartbeat();
  await stopAll();      // Immediately aborts all sessions
  process.exit(0);
};
process.on("SIGTERM", () => void shutdown());

stopAll() aborts every session via session.abort(), sets all agents to exited, and kills SDK servers. No git push, no checkpoint save, no TownDO notification. Agents lose all uncommitted work.

Proposed Behavior

Phase 1: Notify TownDO (immediate, on SIGTERM)

Container sends POST /api/towns/:townId/container-eviction to the worker. The TownDO:

  • Inserts a container_eviction event
  • The reconciler processes this event and:
    • Sets a draining flag on the town (or container metadata)
    • Stops emitting dispatch_agent actions for new work
    • Does NOT interrupt currently running agents

Phase 2: Nudge running agents to save and park (first 2 minutes)

For each running agent, inject a nudge message:

  • Polecats: "The container is shutting down. Please commit and push your current changes immediately, then call gt_done. You have 2 minutes."
  • Refinery: "The container is shutting down. If your review is complete, call gt_done now. Otherwise, your work will be saved and the review will resume after restart."
  • Mayor: No nudge needed — the mayor has no uncommitted work. Its conversation history is already in AgentDO events.

Phase 3: Wait for agents to finish (up to 10 minutes)

Monitor running agents. As each one calls gt_done or finishes its current turn, it exits cleanly. The container waits until:

  • All agents have exited, OR
  • 10 minutes have elapsed (leaving 5 min buffer before SIGKILL)

Phase 4: Force save and exit (last 5 minutes)

For any agents still running after 10 minutes:

  • Force git add -A && git commit -m "WIP: container eviction save" && git push via the SDK's bash tool or direct git execution in the worktree
  • Abort the session
  • Report agentCompleted with a reason indicating eviction save

Phase 5: Clean exit

  • stopAll() as today (cleanup SDK instances)
  • process.exit(0)

Reconciler Integration

The container_eviction event is processed by applyEvent:

case "container_eviction": {
  // Set draining flag — reconciler checks this before emitting dispatch_agent
  setTownDraining(sql, true);
  return;
}

reconcileBeads Rule 1 and reconcileReviewQueue Rule 5 check the draining flag:

if (isTownDraining(sql)) return []; // No new dispatches

After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and the reconciler resumes normal dispatch. The container_status observation pre-phase in the alarm would detect agents that were running before eviction are now gone, setting them to idle and triggering normal recovery.

Combined with #1236 (session persistence via AgentDO event reconstruction), agents that were mid-work will be re-dispatched to the new container with their conversation history intact.

Timing Budget

T+0:00  SIGTERM received
T+0:01  Notify TownDO (draining), nudge agents to save
T+0:01  Agents begin saving (commit, push, gt_done)
T+2:00  Most agents have finished saving
T+10:00 Force-save remaining agents (git commit + push)
T+10:30 stopAll(), report completions
T+11:00 process.exit(0)
T+15:00 SIGKILL (we should be long gone)

Files

  • container/src/control-server.ts — SIGTERM handler (line 649-657)
  • container/src/process-manager.tsstopAll() (line 759), new drainAll() function
  • src/dos/town/reconciler.ts — draining flag check in dispatch rules
  • src/dos/town/events.ts — new container_eviction event type
  • src/dos/Town.do.ts — new /container-eviction endpoint

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Should fix before soft launchenhancementNew feature or requestgt:containerContainer management, agent processes, SDK, heartbeatgt:coreReconciler, state machine, bead lifecycle, convoy flowkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions