Parent
Part of #204 (Phase 3: Multi-Rig + Scaling)
Problem
When the container restarts (deploy, eviction, crash, sleep/wake), all agent sessions lose their entire conversation history. The Mayor is the worst-hit — a user who has been chatting with the Mayor for an hour suddenly gets a fresh session with zero memory of what was discussed. Polecats are more forgiving (focused tasks), but even they lose context about partial work and prior reasoning.
Currently, sendMayorMessage re-dispatches the Mayor with checkpoint: null (Town.do.ts:1810) — it does not even read the Mayor's checkpoint, let alone restore conversation history. The new session gets only the user's current message as its initial prompt.
What's Preserved vs. Lost Today
Preserved in DOs (survives restarts):
- Agent metadata, bead data, checkpoints — TownDO SQLite
- SDK streaming events (
message.created, message.completed, message_part.updated, assistant.completed) — AgentDO rig_agent_events table
- Bead events, mail, review queue, convoys — TownDO SQLite
Lost in container memory:
- Full conversation history (all user/assistant/tool messages)
- SDK session state
- Active tool call state
- In-progress reasoning
Key Insight: AgentDO Already Has the Data
The AgentDO stores SDK streaming events that contain message content. These events include message.created, message.completed, message_part.updated, and assistant.completed — the raw material to reconstruct conversation turns. The 10,000 event cap is generous (a typical Mayor session produces ~5-20 events per turn, so ~500-2000 events for a 100-turn conversation).
The events are streaming deltas, not clean {role, content} turns, but they can be reassembled.
Solution
Three tiers: a quick fix (context injection from existing data), graceful eviction handling (save work during the SIGTERM window), and the long-term strategy (DO-backed persistence so nothing is ever lost).
Tier 1: Context injection from AgentDO events (quick fix)
The AgentDO already stores SDK streaming events with message content. Reconstruct the conversation from these events and inject it into the new session on re-dispatch.
1a. Conversation reconstruction function
Add a reconstructConversation(agentId) function that:
- Queries
AgentDO.getEvents() for the agent's last session
- Filters for message-type events (
message.created, message.completed, message_part.updated)
- Groups events by message boundaries (using
message.created → message.completed as delimiters)
- Reassembles streaming deltas into complete
{role: 'user'|'assistant', content: string} turns
- Returns a conversation transcript (array of turns or formatted text)
This does not need to be perfect — the goal is semantic continuity, not byte-level replay.
1b. Context injection on re-dispatch
When sendMayorMessage detects the Mayor needs re-dispatch (container restarted, isAlive = false):
- Call
reconstructConversation(mayorAgentId) to get the prior transcript
- Include the transcript in the initial prompt as prior conversation context
- The Mayor continues with awareness of what was discussed
1c. Fix Mayor checkpoint propagation
sendMayorMessage at Town.do.ts:1810 passes checkpoint: null. Change to read the Mayor's checkpoint:
const checkpoint = agents.readCheckpoint(this.sql, mayorAgent.id);
This is a one-line fix that should be done immediately, independent of the larger work.
Context window management
- Truncate to last N turns — keep the most recent conversation (e.g., last 50 turns)
- Summarize older turns — use an LLM call to summarize the first half into a paragraph, keep the recent half verbatim
- Token budget — set a max token budget for the restored transcript (e.g., 20% of the model's context window) and truncate/summarize to fit
Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park
Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (control-server.ts:649-654) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.
From Cloudflare docs:
When a container instance is going to be shut down, it is sent a SIGTERM signal, and then a SIGKILL signal after 15 minutes. You should perform any necessary cleanup to ensure a graceful shutdown in this time. The container instance will be rebooted elsewhere shortly after this.
Current Behavior
// control-server.ts:649-654
const shutdown = async () => {
stopHeartbeat();
await stopAll(); // Immediately aborts all sessions
process.exit(0);
};
process.on("SIGTERM", () => void shutdown());
stopAll() aborts every session via session.abort(), sets all agents to exited, and kills SDK servers. No git push, no checkpoint save, no TownDO notification.
Proposed SIGTERM Handling
Phase 1: Notify TownDO (immediate, on SIGTERM)
Container sends POST /api/towns/:townId/container-eviction to the worker. The TownDO:
- Inserts a
container_eviction event
- The reconciler processes this event and sets a
draining flag
- Reconciler stops emitting
dispatch_agent actions for new work
- Does NOT interrupt currently running agents
Phase 2: Nudge running agents to save and park (first 2 minutes)
For each running agent, inject a nudge message:
- Polecats: "The container is shutting down. Please commit and push your current changes immediately, then call gt_done. You have 2 minutes."
- Refinery: "The container is shutting down. If your review is complete, call gt_done now. Otherwise, your work will be saved and the review will resume after restart."
- Mayor: No nudge needed — conversation history is already in AgentDO events.
Phase 3: Wait for agents to finish (up to 10 minutes)
Monitor running agents. As each one calls gt_done or finishes, it exits cleanly. Wait until all agents have exited OR 10 minutes have elapsed (leaving 5 min buffer before SIGKILL).
Phase 4: Force save and exit (last 5 minutes)
For any agents still running after 10 minutes:
- Force
git add -A && git commit -m "WIP: container eviction save" && git push
- Abort the session
- Report
agentCompleted with a reason indicating eviction save
Phase 5: Clean exit — stopAll() as today, then process.exit(0).
Timing Budget
T+0:00 SIGTERM received
T+0:01 Notify TownDO (draining), nudge agents to save
T+0:01 Agents begin saving (commit, push, gt_done)
T+2:00 Most agents have finished saving
T+10:00 Force-save remaining agents (git commit + push)
T+10:30 stopAll(), report completions
T+11:00 process.exit(0)
T+15:00 SIGKILL (we should be long gone)
Reconciler Integration
applyEvent('container_eviction') sets a draining flag. reconcileBeads Rule 1 and reconcileReviewQueue Rule 5 check it before emitting dispatch_agent. After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and dispatch resumes.
Tier 2: KV-backed persistence — never write to disk (long-term strategy)
The Tier 1 approach reconstructs from streaming deltas (fragile, lossy). The proper fix is to never lose the state in the first place by persisting all session state instead of losing it in ephemeral container disk/memory.
Design principle: Flush the SQLite Blob to Cloudflare KV
Cloudflare Container disks are ephemeral — anything written is lost on eviction. While Durable Objects are durable and globally consistent, their SQLite rows have a 2MB size limit. The agent's SDK SQLite database (kilo.db) can grow larger than 2MB. Therefore, we will use Cloudflare KV, which has a 25MB value limit, to store the database blob.
Instead of intercepting individual Drizzle ORM queries, we can use a Write-behind cache (Option C) approach by treating the agent's SDK SQLite database (kilo.db) as a binary blob and syncing the entire file to Cloudflare KV.
1. Isolate the DB per Agent
Currently, agents running in the same container share the same kilo.db file in ~/.local/share/kilo/kilo.db. The kilocode CLI supports overriding the home directory via the KILO_TEST_HOME environment variable (thanks to a custom patch).
We inject KILO_TEST_HOME=/tmp/agent-home-${agentId} into the agent's environment payload during buildAgentEnv() / startAgent().
2. Hydration (Boot)
Before spawning kilo serve in startAgent, the container executes a GET /api/towns/${townId}/rigs/${rigId}/agents/${agentId}/db-snapshot which fetches the blob from Cloudflare KV.
If a blob is returned, it writes the blob to /tmp/agent-home-${agentId}/.local/share/kilo/kilo.db (ensuring parent directories exist). The kilo serve process will boot and resume the exact session.
3. Flush (Eviction & Idle)
- On Eviction: During
drainAll() in process-manager.ts, the SDK server is explicitly aborted and stopped. This ensures the SQLite WAL is flushed and the .db file is cleanly unlocked. We then read the kilo.db file from the isolated /tmp/ path and POST it back to the worker to be stored in KV.
- Periodic/Idle Flush: Inside
handleIdleEvent(), we can trigger a non-blocking background flush of the kilo.db file to KV so that progress isn't solely reliant on the container gracefully draining.
4. KV Storage Integration
Add a new KV namespace binding for agent DB snapshots (e.g., AGENT_DB_SNAPSHOTS_KV).
Create endpoints in the worker that accept the binary blob and write it using AGENT_DB_SNAPSHOTS_KV.put(agentId, blob) and read it via AGENT_DB_SNAPSHOTS_KV.get(agentId, "arrayBuffer"). This avoids storing potentially large DB blobs in AgentDO's SQLite.
5. Process Registry in TownContainerDO
Persist the ProcessManager agents Map to TownContainerDO so the new container instance knows which agents were running, their session IDs, and ports. Boot becomes: read registry → hydrate sessions from KV → resume.
Recovery flow with KV-backed persistence
Container evicted
→ TownDO alarm detects dead container
→ New container starts
→ Control server reads process registry from TownContainerDO
→ For each previously-running agent:
→ Fetch `kilo.db` snapshot from Cloudflare KV
→ Write to local `/tmp/agent-home-${agentId}/.../kilo.db`
→ Start SDK server with isolated `KILO_TEST_HOME`
→ Agent resumes mid-conversation — no context loss
→ Report ready to TownDO
Files
container/src/control-server.ts — SIGTERM handler (line 649-657)
container/src/process-manager.ts — stopAll() (line 759), new drainAll() function
src/dos/town/reconciler.ts — draining flag check in dispatch rules
src/dos/town/events.ts — new container_eviction event type
src/dos/Town.do.ts — sendMayorMessage checkpoint fix, new /container-eviction endpoint
Acceptance Criteria
Tier 1 (quick fix)
Tier 1.5 (graceful eviction)
Tier 2 (DO-backed persistence)
References
Parent
Part of #204 (Phase 3: Multi-Rig + Scaling)
Problem
When the container restarts (deploy, eviction, crash, sleep/wake), all agent sessions lose their entire conversation history. The Mayor is the worst-hit — a user who has been chatting with the Mayor for an hour suddenly gets a fresh session with zero memory of what was discussed. Polecats are more forgiving (focused tasks), but even they lose context about partial work and prior reasoning.
Currently,
sendMayorMessagere-dispatches the Mayor withcheckpoint: null(Town.do.ts:1810) — it does not even read the Mayor's checkpoint, let alone restore conversation history. The new session gets only the user's current message as its initial prompt.What's Preserved vs. Lost Today
Preserved in DOs (survives restarts):
message.created,message.completed,message_part.updated,assistant.completed) — AgentDOrig_agent_eventstableLost in container memory:
Key Insight: AgentDO Already Has the Data
The AgentDO stores SDK streaming events that contain message content. These events include
message.created,message.completed,message_part.updated, andassistant.completed— the raw material to reconstruct conversation turns. The 10,000 event cap is generous (a typical Mayor session produces ~5-20 events per turn, so ~500-2000 events for a 100-turn conversation).The events are streaming deltas, not clean
{role, content}turns, but they can be reassembled.Solution
Three tiers: a quick fix (context injection from existing data), graceful eviction handling (save work during the SIGTERM window), and the long-term strategy (DO-backed persistence so nothing is ever lost).
Tier 1: Context injection from AgentDO events (quick fix)
The AgentDO already stores SDK streaming events with message content. Reconstruct the conversation from these events and inject it into the new session on re-dispatch.
1a. Conversation reconstruction function
Add a
reconstructConversation(agentId)function that:AgentDO.getEvents()for the agent's last sessionmessage.created,message.completed,message_part.updated)message.created→message.completedas delimiters){role: 'user'|'assistant', content: string}turnsThis does not need to be perfect — the goal is semantic continuity, not byte-level replay.
1b. Context injection on re-dispatch
When
sendMayorMessagedetects the Mayor needs re-dispatch (container restarted,isAlive = false):reconstructConversation(mayorAgentId)to get the prior transcript1c. Fix Mayor checkpoint propagation
sendMayorMessageatTown.do.ts:1810passescheckpoint: null. Change to read the Mayor's checkpoint:This is a one-line fix that should be done immediately, independent of the larger work.
Context window management
Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park
Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (
control-server.ts:649-654) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.From Cloudflare docs:
Current Behavior
stopAll()aborts every session viasession.abort(), sets all agents toexited, and kills SDK servers. No git push, no checkpoint save, no TownDO notification.Proposed SIGTERM Handling
Phase 1: Notify TownDO (immediate, on SIGTERM)
Container sends
POST /api/towns/:townId/container-evictionto the worker. The TownDO:container_evictioneventdrainingflagdispatch_agentactions for new workPhase 2: Nudge running agents to save and park (first 2 minutes)
For each running agent, inject a nudge message:
Phase 3: Wait for agents to finish (up to 10 minutes)
Monitor running agents. As each one calls
gt_doneor finishes, it exits cleanly. Wait until all agents have exited OR 10 minutes have elapsed (leaving 5 min buffer before SIGKILL).Phase 4: Force save and exit (last 5 minutes)
For any agents still running after 10 minutes:
git add -A && git commit -m "WIP: container eviction save" && git pushagentCompletedwith a reason indicating eviction savePhase 5: Clean exit —
stopAll()as today, thenprocess.exit(0).Timing Budget
Reconciler Integration
applyEvent('container_eviction')sets a draining flag.reconcileBeadsRule 1 andreconcileReviewQueueRule 5 check it before emittingdispatch_agent. After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and dispatch resumes.Tier 2: KV-backed persistence — never write to disk (long-term strategy)
The Tier 1 approach reconstructs from streaming deltas (fragile, lossy). The proper fix is to never lose the state in the first place by persisting all session state instead of losing it in ephemeral container disk/memory.
Design principle: Flush the SQLite Blob to Cloudflare KV
Cloudflare Container disks are ephemeral — anything written is lost on eviction. While Durable Objects are durable and globally consistent, their SQLite rows have a 2MB size limit. The agent's SDK SQLite database (
kilo.db) can grow larger than 2MB. Therefore, we will use Cloudflare KV, which has a 25MB value limit, to store the database blob.Instead of intercepting individual Drizzle ORM queries, we can use a Write-behind cache (Option C) approach by treating the agent's SDK SQLite database (
kilo.db) as a binary blob and syncing the entire file to Cloudflare KV.1. Isolate the DB per Agent
Currently, agents running in the same container share the same
kilo.dbfile in~/.local/share/kilo/kilo.db. The kilocode CLI supports overriding the home directory via theKILO_TEST_HOMEenvironment variable (thanks to a custom patch).We inject
KILO_TEST_HOME=/tmp/agent-home-${agentId}into the agent's environment payload duringbuildAgentEnv()/startAgent().2. Hydration (Boot)
Before spawning
kilo serveinstartAgent, the container executes aGET /api/towns/${townId}/rigs/${rigId}/agents/${agentId}/db-snapshotwhich fetches the blob from Cloudflare KV.If a blob is returned, it writes the blob to
/tmp/agent-home-${agentId}/.local/share/kilo/kilo.db(ensuring parent directories exist). Thekilo serveprocess will boot and resume the exact session.3. Flush (Eviction & Idle)
drainAll()inprocess-manager.ts, the SDK server is explicitly aborted and stopped. This ensures the SQLite WAL is flushed and the.dbfile is cleanly unlocked. We then read thekilo.dbfile from the isolated/tmp/path andPOSTit back to the worker to be stored in KV.handleIdleEvent(), we can trigger a non-blocking background flush of thekilo.dbfile to KV so that progress isn't solely reliant on the container gracefully draining.4. KV Storage Integration
Add a new KV namespace binding for agent DB snapshots (e.g.,
AGENT_DB_SNAPSHOTS_KV).Create endpoints in the worker that accept the binary blob and write it using
AGENT_DB_SNAPSHOTS_KV.put(agentId, blob)and read it viaAGENT_DB_SNAPSHOTS_KV.get(agentId, "arrayBuffer"). This avoids storing potentially large DB blobs in AgentDO's SQLite.5. Process Registry in TownContainerDO
Persist the
ProcessManageragentsMap to TownContainerDO so the new container instance knows which agents were running, their session IDs, and ports. Boot becomes: read registry → hydrate sessions from KV → resume.Recovery flow with KV-backed persistence
Files
container/src/control-server.ts— SIGTERM handler (line 649-657)container/src/process-manager.ts—stopAll()(line 759), newdrainAll()functionsrc/dos/town/reconciler.ts— draining flag check in dispatch rulessrc/dos/town/events.ts— newcontainer_evictionevent typesrc/dos/Town.do.ts—sendMayorMessagecheckpoint fix, new/container-evictionendpointAcceptance Criteria
Tier 1 (quick fix)
sendMayorMessagereads and passes the Mayor's checkpointTier 1.5 (graceful eviction)
Tier 2 (DO-backed persistence)
References