Persist agent conversation across container restarts via AgentDO event reconstruction

## Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

## Problem

When the container restarts (deploy, eviction, crash, sleep/wake), all agent sessions lose their entire conversation history. The Mayor is the worst-hit — a user who has been chatting with the Mayor for an hour suddenly gets a fresh session with zero memory of what was discussed. Polecats are more forgiving (focused tasks), but even they lose context about partial work and prior reasoning.

Currently, `sendMayorMessage` re-dispatches the Mayor with `checkpoint: null` (`Town.do.ts:1810`) — it does not even read the Mayor's checkpoint, let alone restore conversation history. The new session gets only the user's current message as its initial prompt.

## What's Preserved vs. Lost Today

**Preserved in DOs (survives restarts):**
- Agent metadata, bead data, checkpoints — TownDO SQLite
- SDK streaming events (`message.created`, `message.completed`, `message_part.updated`, `assistant.completed`) — AgentDO `rig_agent_events` table
- Bead events, mail, review queue, convoys — TownDO SQLite

**Lost in container memory:**
- Full conversation history (all user/assistant/tool messages)
- SDK session state
- Active tool call state
- In-progress reasoning

## Key Insight: AgentDO Already Has the Data

The AgentDO stores SDK streaming events that contain message content. These events include `message.created`, `message.completed`, `message_part.updated`, and `assistant.completed` — the raw material to reconstruct conversation turns. The 10,000 event cap is generous (a typical Mayor session produces ~5-20 events per turn, so ~500-2000 events for a 100-turn conversation).

The events are streaming deltas, not clean `{role, content}` turns, but they can be reassembled.

## Solution

Three tiers: a quick fix (context injection from existing data), graceful eviction handling (save work during the SIGTERM window), and the long-term strategy (DO-backed persistence so nothing is ever lost).

### Tier 1: Context injection from AgentDO events (quick fix)

The AgentDO already stores SDK streaming events with message content. Reconstruct the conversation from these events and inject it into the new session on re-dispatch.

#### 1a. Conversation reconstruction function

Add a `reconstructConversation(agentId)` function that:
- Queries `AgentDO.getEvents()` for the agent's last session
- Filters for message-type events (`message.created`, `message.completed`, `message_part.updated`)
- Groups events by message boundaries (using `message.created` → `message.completed` as delimiters)
- Reassembles streaming deltas into complete `{role: 'user'|'assistant', content: string}` turns
- Returns a conversation transcript (array of turns or formatted text)

This does not need to be perfect — the goal is semantic continuity, not byte-level replay.

#### 1b. Context injection on re-dispatch

When `sendMayorMessage` detects the Mayor needs re-dispatch (container restarted, `isAlive = false`):
1. Call `reconstructConversation(mayorAgentId)` to get the prior transcript
2. Include the transcript in the initial prompt as prior conversation context
3. The Mayor continues with awareness of what was discussed

#### 1c. Fix Mayor checkpoint propagation

`sendMayorMessage` at `Town.do.ts:1810` passes `checkpoint: null`. Change to read the Mayor's checkpoint:
```ts
const checkpoint = agents.readCheckpoint(this.sql, mayorAgent.id);
```
This is a one-line fix that should be done immediately, independent of the larger work.

#### Context window management

- **Truncate to last N turns** — keep the most recent conversation (e.g., last 50 turns)
- **Summarize older turns** — use an LLM call to summarize the first half into a paragraph, keep the recent half verbatim
- **Token budget** — set a max token budget for the restored transcript (e.g., 20% of the model's context window) and truncate/summarize to fit

---

### Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park

Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (`control-server.ts:649-654`) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.

From [Cloudflare docs](https://developers.cloudflare.com/containers/):
> When a container instance is going to be shut down, it is sent a SIGTERM signal, and then a SIGKILL signal after 15 minutes. You should perform any necessary cleanup to ensure a graceful shutdown in this time. The container instance will be rebooted elsewhere shortly after this.

#### Current Behavior

```ts
// control-server.ts:649-654
const shutdown = async () => {
  stopHeartbeat();
  await stopAll();      // Immediately aborts all sessions
  process.exit(0);
};
process.on("SIGTERM", () => void shutdown());
```

`stopAll()` aborts every session via `session.abort()`, sets all agents to `exited`, and kills SDK servers. No git push, no checkpoint save, no TownDO notification.

#### Proposed SIGTERM Handling

**Phase 1: Notify TownDO (immediate, on SIGTERM)**

Container sends `POST /api/towns/:townId/container-eviction` to the worker. The TownDO:
- Inserts a `container_eviction` event
- The reconciler processes this event and sets a `draining` flag
- Reconciler stops emitting `dispatch_agent` actions for new work
- Does NOT interrupt currently running agents

**Phase 2: Nudge running agents to save and park (first 2 minutes)**

For each running agent, inject a nudge message:
- **Polecats**: "The container is shutting down. Please commit and push your current changes immediately, then call gt_done. You have 2 minutes."
- **Refinery**: "The container is shutting down. If your review is complete, call gt_done now. Otherwise, your work will be saved and the review will resume after restart."
- **Mayor**: No nudge needed — conversation history is already in AgentDO events.

**Phase 3: Wait for agents to finish (up to 10 minutes)**

Monitor running agents. As each one calls `gt_done` or finishes, it exits cleanly. Wait until all agents have exited OR 10 minutes have elapsed (leaving 5 min buffer before SIGKILL).

**Phase 4: Force save and exit (last 5 minutes)**

For any agents still running after 10 minutes:
- Force `git add -A && git commit -m "WIP: container eviction save" && git push`
- Abort the session
- Report `agentCompleted` with a reason indicating eviction save

**Phase 5: Clean exit** — `stopAll()` as today, then `process.exit(0)`.

#### Timing Budget

```
T+0:00  SIGTERM received
T+0:01  Notify TownDO (draining), nudge agents to save
T+0:01  Agents begin saving (commit, push, gt_done)
T+2:00  Most agents have finished saving
T+10:00 Force-save remaining agents (git commit + push)
T+10:30 stopAll(), report completions
T+11:00 process.exit(0)
T+15:00 SIGKILL (we should be long gone)
```

#### Reconciler Integration

`applyEvent('container_eviction')` sets a draining flag. `reconcileBeads` Rule 1 and `reconcileReviewQueue` Rule 5 check it before emitting `dispatch_agent`. After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and dispatch resumes.

---

### Tier 2: KV-backed persistence — never write to disk (long-term strategy)

The Tier 1 approach reconstructs from streaming deltas (fragile, lossy). The proper fix is to **never lose the state in the first place** by persisting all session state instead of losing it in ephemeral container disk/memory.

#### Design principle: Flush the SQLite Blob to Cloudflare KV

Cloudflare Container disks are ephemeral — anything written is lost on eviction. While Durable Objects are durable and globally consistent, their SQLite rows have a 2MB size limit. The agent's SDK SQLite database (`kilo.db`) can grow larger than 2MB. Therefore, we will use **Cloudflare KV**, which has a 25MB value limit, to store the database blob.

Instead of intercepting individual Drizzle ORM queries, we can use a **Write-behind cache (Option C)** approach by treating the agent's SDK SQLite database (`kilo.db`) as a binary blob and syncing the entire file to Cloudflare KV.

**1. Isolate the DB per Agent**
Currently, agents running in the same container share the same `kilo.db` file in `~/.local/share/kilo/kilo.db`. The kilocode CLI supports overriding the home directory via the `KILO_TEST_HOME` environment variable (thanks to a custom patch).
We inject `KILO_TEST_HOME=/tmp/agent-home-${agentId}` into the agent's environment payload during `buildAgentEnv()` / `startAgent()`.

**2. Hydration (Boot)**
Before spawning `kilo serve` in `startAgent`, the container executes a `GET /api/towns/${townId}/rigs/${rigId}/agents/${agentId}/db-snapshot` which fetches the blob from Cloudflare KV. 
If a blob is returned, it writes the blob to `/tmp/agent-home-${agentId}/.local/share/kilo/kilo.db` (ensuring parent directories exist). The `kilo serve` process will boot and resume the exact session.

**3. Flush (Eviction & Idle)**
- **On Eviction:** During `drainAll()` in `process-manager.ts`, the SDK server is explicitly aborted and stopped. This ensures the SQLite WAL is flushed and the `.db` file is cleanly unlocked. We then read the `kilo.db` file from the isolated `/tmp/` path and `POST` it back to the worker to be stored in KV.
- **Periodic/Idle Flush:** Inside `handleIdleEvent()`, we can trigger a non-blocking background flush of the `kilo.db` file to KV so that progress isn't solely reliant on the container gracefully draining.

**4. KV Storage Integration**
Add a new KV namespace binding for agent DB snapshots (e.g., `AGENT_DB_SNAPSHOTS_KV`).
Create endpoints in the worker that accept the binary blob and write it using `AGENT_DB_SNAPSHOTS_KV.put(agentId, blob)` and read it via `AGENT_DB_SNAPSHOTS_KV.get(agentId, "arrayBuffer")`. This avoids storing potentially large DB blobs in AgentDO's SQLite.

**5. Process Registry in TownContainerDO**
Persist the `ProcessManager` `agents` Map to TownContainerDO so the new container instance knows which agents were running, their session IDs, and ports. Boot becomes: read registry → hydrate sessions from KV → resume.

#### Recovery flow with KV-backed persistence

```
Container evicted
  → TownDO alarm detects dead container
  → New container starts
  → Control server reads process registry from TownContainerDO
  → For each previously-running agent:
      → Fetch `kilo.db` snapshot from Cloudflare KV
      → Write to local `/tmp/agent-home-${agentId}/.../kilo.db`
      → Start SDK server with isolated `KILO_TEST_HOME`
      → Agent resumes mid-conversation — no context loss
  → Report ready to TownDO
```
## Files

- `container/src/control-server.ts` — SIGTERM handler (line 649-657)
- `container/src/process-manager.ts` — `stopAll()` (line 759), new `drainAll()` function
- `src/dos/town/reconciler.ts` — draining flag check in dispatch rules
- `src/dos/town/events.ts` — new `container_eviction` event type
- `src/dos/Town.do.ts` — `sendMayorMessage` checkpoint fix, new `/container-eviction` endpoint

## Acceptance Criteria

### Tier 1 (quick fix)
- [ ] `sendMayorMessage` reads and passes the Mayor's checkpoint
- [ ] Conversation history reconstructed from AgentDO events on re-dispatch
- [ ] Mayor re-dispatch includes prior conversation transcript in context
- [ ] Transcript truncated/summarized to fit context window limits

### Tier 1.5 (graceful eviction)
- [ ] SIGTERM triggers a drain sequence instead of immediate abort
- [ ] TownDO notified of eviction, reconciler pauses dispatch
- [ ] Running agents nudged to save and push
- [ ] Force-save after 10 minutes for stragglers
- [ ] Container exits cleanly within 15-minute SIGKILL window

### Tier 2 (DO-backed persistence)
- [ ] SDK session state persisted to AgentDO
- [ ] Control server process registry persisted to TownContainerDO
- [ ] Container boot hydrates from DOs — agents resume with full context
- [ ] Zero context loss on container eviction

## References

- #269 — Container Resilience — Checkpoint/Restore (git state recovery)
- [Reconciler spec](../docs/gt/reconciliation-spec.md) — event and action types
- [PR #1336](https://github.com/Kilo-Org/cloud/pull/1336) — Reconciler implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist agent conversation across container restarts via AgentDO event reconstruction #1236

Parent

Problem

What's Preserved vs. Lost Today

Key Insight: AgentDO Already Has the Data

Solution

Tier 1: Context injection from AgentDO events (quick fix)

1a. Conversation reconstruction function

1b. Context injection on re-dispatch

1c. Fix Mayor checkpoint propagation

Context window management

Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park

Current Behavior

Proposed SIGTERM Handling

Timing Budget

Reconciler Integration

Tier 2: KV-backed persistence — never write to disk (long-term strategy)

Design principle: Flush the SQLite Blob to Cloudflare KV

Recovery flow with KV-backed persistence

Files

Acceptance Criteria

Tier 1 (quick fix)

Tier 1.5 (graceful eviction)

Tier 2 (DO-backed persistence)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Persist agent conversation across container restarts via AgentDO event reconstruction #1236

Description

Parent

Problem

What's Preserved vs. Lost Today

Key Insight: AgentDO Already Has the Data

Solution

Tier 1: Context injection from AgentDO events (quick fix)

1a. Conversation reconstruction function

1b. Context injection on re-dispatch

1c. Fix Mayor checkpoint propagation

Context window management

Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park

Current Behavior

Proposed SIGTERM Handling

Timing Budget

Reconciler Integration

Tier 2: KV-backed persistence — never write to disk (long-term strategy)

Design principle: Flush the SQLite Blob to Cloudflare KV

Recovery flow with KV-backed persistence

Files

Acceptance Criteria

Tier 1 (quick fix)

Tier 1.5 (graceful eviction)

Tier 2 (DO-backed persistence)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions