Skip to content

feat(polecat): persist agent conversation across container restarts#1300

Closed
jrf0110 wants to merge 8 commits intomainfrom
convoy/persist-agent-conversation-across-contai/017955a4/head
Closed

feat(polecat): persist agent conversation across container restarts#1300
jrf0110 wants to merge 8 commits intomainfrom
convoy/persist-agent-conversation-across-contai/017955a4/head

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 19, 2026

Summary

Adds conversation transcript reconstruction and injection so that agents resume with full context after a container restart, rather than starting fresh.

  • Introduces reconstructConversation utility (cloudflare-gastown/src/util/reconstruct-conversation.util.ts) that rebuilds { role, content } turns from the raw stream of AgentDO events (message.updated, message.completed, message_part.updated). Handles streaming edge cases: parts arriving before message info, synthetic/ignored parts, tool-only turns, and mid-stream crashes.
  • On re-dispatch of a polecat agent, prior events are fetched from the AgentDO, reconstructed into a transcript, and injected as beadBody so the new container sees the prior session.
  • On Mayor re-dispatch (send message while already dispatched), the same transcript injection is applied, along with forwarding the Mayor's checkpoint.
  • Exposes getAgentEvents via a new tRPC endpoint (with rig ownership verification) so clients can query agent events directly.
  • Fixes z.number().nonneg()z.number().min(0) since nonneg() is not available in the installed Zod version.

Verification

  • 342-line unit test suite in cloudflare-gastown/test/unit/reconstruct-conversation.test.ts covers: basic user↔assistant exchanges, multi-part concatenation, streaming updates, part-before-message ordering, tool-only skipping, synthetic/ignored parts, summary.body fallback, malformed/unknown events, and truncation to maxTurns.
  • Code reviewed for correctness, style, security, and test coverage.
  • No build artifacts or secrets included.

Visual Changes

N/A

Reviewer Notes

The getAgentEvents method on TownDO already existed and typed its return as Promise<unknown[]> for cross-DO type safety. The new dispatch code uses RigAgentEventRecord.array().safeParse(rawEvents) defensively and falls back to an empty transcript if parsing fails, so the re-dispatch path is never broken by unexpected event shapes.

The maxTurns default of 50 is conservative — it keeps the most recent context while bounding prompt size.

jrf0110 added 8 commits March 19, 2026 16:34
Add AgentEventOutput schema and RpcAgentEventOutput wrapper to trpc/schemas.ts,
then wire up a getAgentEvents gastownProcedure in trpc/router.ts that delegates
to TownDO.getAgentEvents() with cursor-based pagination (afterId, limit).
…n-across-contai/017955a4/gt/toast/c278af77' into convoy/persist-agent-conversation-across-contai/017955a4/head
Adds reconstructConversation() that takes a sequence of AgentDO streaming
events and reassembles them into clean { role, content } turns. Handles
message.updated / message.completed (info payload), message_part.updated
(both underscore and dot variants), tool-only turns, synthetic/ignored
parts, parts arriving before message info, and malformed events. Supports
configurable maxTurns truncation (default 50, most-recent kept).

21 unit tests covering happy path, edge cases, and truncation.
…atch

When the Mayor container is dead and needs re-dispatch, reconstruct the
prior conversation using reconstructConversation() and inject it into
beadBody so the Mayor resumes with full context.

Also fix the checkpoint: null bug in sendMayorMessage — now reads
mayor.checkpoint from the agent record instead of always passing null.
When a polecat is re-dispatched after a container restart, reconstruct
the agent's prior session from AgentDO events and inject it as
'Prior conversation:...' in beadBody, matching the same pattern used
for Mayor re-dispatch in sendMayorMessage(). This prevents duplicate
work and gives the new container full context of what was done before.
nonneg() does not exist in the installed version of zod; min(0) is the
correct equivalent.
.query(async ({ ctx, input }) => {
const rig = await verifyRigOwnership(ctx.env, ctx.userId, input.rigId, ctx.orgMemberships);
const townStub = getTownDOStub(ctx.env, rig.town_id);
return townStub.getAgentEvents(input.agentId, input.afterId, input.limit);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: Missing rig-to-agent authorization check

This endpoint authorizes input.rigId but then returns events for an arbitrary input.agentId. Because TownDO.getAgentEvents() simply proxies to the AgentDO by id, a caller who knows another agent UUID can read its full transcript without proving that agent belongs to the requested rig/town.

// Reconstruct the agent's prior session transcript and inject it on
// re-dispatch (after a container restart) so work isn't duplicated.
// The presence of prior events is the signal: a fresh container has none.
const rawEvents = await this.getAgentEvents(agent.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Prior transcript is not scoped to the current bead/session

getAgentEvents(agent.id) returns the agent's entire event log. Polecats are reused once they go idle, and refineries are singleton per rig, so after a later container restart this will replay transcript from previous beads into the next assignment. This needs a bead/session boundary or an event-log reset before reconstructing context.

const priorTranscript = priorTurns
.map(t => `[${t.role === 'user' ? 'User' : 'Assistant'}]: ${t.content}`)
.join('\n\n');
beadBody = `Prior conversation:\n\n${priorTranscript}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This drops the bead body on resume

When prior turns exist, beadBody becomes only the reconstructed transcript, so the original bead description and acceptance criteria in bead.body disappear from the restart prompt. A restarted agent can lose the task details even though the transcript is restored.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 19, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Severity Count
CRITICAL 1
WARNING 2
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

CRITICAL

File Line Issue
cloudflare-gastown/src/trpc/router.ts 922 getAgentEvents authorizes the rig but never verifies that the requested agent belongs to that rig/town, which exposes arbitrary agent transcripts to callers who know another agent UUID.

WARNING

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3116 Re-dispatch reconstructs history from the agent's entire lifetime event log, so reused polecat/refinery agents can inject transcript from an unrelated prior bead after a container restart.
cloudflare-gastown/src/dos/Town.do.ts 3124 Replacing beadBody with only the reconstructed transcript drops the original bead description and acceptance criteria from the restart prompt.
Other Observations (not in diff)

None.

Files Reviewed (5 files)
  • cloudflare-gastown/src/dos/Town.do.ts - 2 issues
  • cloudflare-gastown/src/trpc/router.ts - 1 issue
  • cloudflare-gastown/src/trpc/schemas.ts - 0 issues
  • cloudflare-gastown/src/util/reconstruct-conversation.util.ts - 0 issues
  • cloudflare-gastown/test/unit/reconstruct-conversation.test.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 1,085,843 tokens

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Mar 19, 2026

Refinery Review — Request Changes

All CI checks pass and the core reconstructConversation utility is well-implemented with good test coverage. The checkpoint fix for Mayor re-dispatch (checkpoint: nullmayor.checkpoint) is a solid bug fix. However, there are three correctness/security issues that need to be addressed before landing.


1. CRITICAL — Authorization gap in getAgentEvents tRPC endpoint

File: cloudflare-gastown/src/trpc/router.ts

verifyRigOwnership validates that the caller owns input.rigId, but then input.agentId is passed directly to townStub.getAgentEvents() without verifying that the agent belongs to that rig/town. TownDO.getAgentEvents does not call ensureInitialized() and simply proxies to the AgentDO by UUID — no membership check whatsoever.

A caller who knows another user's agent UUID can call this endpoint with their own rigId and any agentId to read the full conversation transcript of an agent they don't own.

Fix: Verify the agent belongs to the rig before returning events. After calling verifyRigOwnership, pass rig.id down to the TownDO and check agent.rig_id === rig.id there, or add a getAgent lookup in the endpoint before calling getAgentEvents. Compare to the deleteAgent handler (router.ts:548) which takes the same rigId + agentId shape — but note that deleteAgent also doesn't verify the agent-rig relationship today, so it may have the same gap.


2. BUG — Event log is not scoped to the current bead/session (dispatchAgent)

File: cloudflare-gastown/src/dos/Town.do.ts ~line 3116

getAgentEvents(agent.id) returns the agent's entire event history, not just events from the current bead/session. Polecat agents are reused across beads (they go idle, get a new bead, get re-dispatched). After a container restart on bead N, the transcript injected will include turns from beads N-1, N-2, etc.

The comment says "The presence of prior events is the signal: a fresh container has none" — but this does not distinguish "restarted mid-current-task" from "agent was previously used for a different task". The result is that a restarted agent will receive transcript from entirely different prior work as its "prior conversation".

Fix: Scope the event query to events created after the agent was hooked to the current bead. The bead's created_at or the agent's last_activity_at at hook time could serve as a cursor boundary, or events could be cleared/reset when the agent transitions to a new bead.


3. BUG — Original bead.body is discarded on restart (dispatchAgent)

File: cloudflare-gastown/src/dos/Town.do.ts ~lines 3119-3124

let beadBody = bead.body ?? '';
if (priorTurns.length > 0) {
  beadBody = `Prior conversation:\n\n${priorTranscript}`;  // bead.body is dropped here
}

When prior turns exist, beadBody is replaced entirely by the transcript. The original task description, acceptance criteria, and any instructions in bead.body are lost. A restarted agent will have conversation history but no task context.

Fix: Prepend/append rather than replace:

const base = bead.body ?? '';
beadBody = base
  ? `${base}\n\nPrior conversation:\n\n${priorTranscript}`
  : `Prior conversation:\n\n${priorTranscript}`;

Note: The Mayor re-dispatch path does not have this issue since Mayor's beadBody was already '' before this PR — the transcript is purely additive there. The session-scoping concern (issue 2) still applies to Mayor.


Summary: Issue 1 is a security/privacy issue that should be fixed before landing. Issues 2 and 3 are correctness bugs that will cause confusing agent behavior on restart. Recommend fixing all three before merge.

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Apr 5, 2026

Refinery code review passed. Previously requested review issues appear resolved in the latest diff.

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Apr 6, 2026

Refinery re-review complete. Previously raised issues are still present: the new tRPC endpoint authorizes only the rig id but does not verify that the requested agent belongs to that rig/town, and the restart transcript recovery still replays the agent's full historical event log while replacing the original bead body. PR is not ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant