Skip to content

Bug: Deleting a town doesn't clean up TownDO, TownContainerDO, or AgentDOs — resources leak indefinitely #1182

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

Problem

Deleting a town from the UI does not clean up the underlying Durable Objects or containers. With only 3 active users and 4 active towns, there are tens of TownDOs still running alarm loops and ~10 extra TownContainerDOs consuming container instances. Deleted towns continue running indefinitely, burning Cloudflare resources.

Root Causes (6 compounding issues)

1. tRPC deleteTown doesn't call TownDO.destroy() — THE PRIMARY BUG

File: src/trpc/router.ts:156-162

The UI calls deleteTown via tRPC, which calls GastownUserDO.deleteTown(). This only deletes rows from the user_towns and user_rigs tables in GastownUserDO's storage. It never touches the TownDO, TownContainerDO, or AgentDOs. The town is removed from the user's list but continues running everything.

The HTTP handler at src/handlers/towns.handler.ts:128-146 does call townDOStub.destroy(), but the tRPC path (which the UI uses) doesn't.

2. Neither deletion path destroys the TownContainerDO

TownDO.destroy() at Town.do.ts:4044-4058 deletes agents and clears TownDO storage, but never calls getTownContainerStub().destroy(). The HTTP handler also doesn't destroy the container. No code path stops the container when a town is deleted.

3. TownDO alarm unconditionally re-arms

File: Town.do.ts:2638-2640

const active = this.hasActiveWork();
const interval = active ? ACTIVE_ALARM_INTERVAL_MS : IDLE_ALARM_INTERVAL_MS;
await this.ctx.storage.setAlarm(Date.now() + interval);

There is no exit condition. Even idle towns with no work fire every 60 seconds forever. The alarm never checks if the town has been deleted.

4. TownDO alarm keeps the container alive

ensureContainerReady() at Town.do.ts:3775 calls container.fetch('http://container/health') on every alarm tick when the town has rigs. Each health check resets the TownContainerDO's sleepAfter 30-minute timer. Since the alarm fires every 5-60 seconds, the container never sleeps — even for deleted towns that still have rigs in their tables.

5. Constructor re-arms alarm on any interaction

initializeDatabase() at Town.do.ts:437-439 calls armAlarmIfNeeded(). Even after a successful destroy(), if anything touches the DO stub (a stale request, dashboard polling, a late API call), the constructor fires and re-creates the alarm loop from scratch — on a DO with empty tables. This is a resurrection problem.

6. Compatibility date too old for deleteAll() to clear alarms

wrangler.jsonc:5 has compatibility date 2026-01-27. Cloudflare added deleteAll() clearing alarms at 2026-02-24. The code compensates by calling deleteAlarm() explicitly in destroy(), but this is fragile when destroy() is never called (root cause #1).

Resource Leak Summary

Resource tRPC deleteTown (UI path) HTTP DELETE handler
GastownUserDO rows Deleted Deleted
TownDO storage Never cleared Cleared via destroy()
TownDO alarm Never deleted — fires every 5-60s forever Deleted, but can resurrect
TownContainerDO Never stopped Never stopped
AgentDOs Never cleaned Cleaned via TownDO.destroy()
Container process Runs indefinitely Runs indefinitely

Fix

Fix 1: tRPC deleteTown must call TownDO.destroy()

In src/trpc/router.ts:156-162, after verifyTownOwnership, call townDOStub.destroy() before (or alongside) userStub.deleteTown(). Match the HTTP handler's behavior.

Fix 2: TownDO.destroy() must destroy ALL sub-DOs and their storage

TownDO.destroy() at Town.do.ts:4044-4058 currently iterates AgentDOs and calls their destroy() (which clears their agent_events tables and alarms). But it's missing:

Sub-DOs with independent storage that must be cleaned up:

Sub-DO Keyed by Has destroy()? Has storage? Currently cleaned?
AgentDO agentId Yes (Agent.do.ts:113) Yes — rig_agent_events table (high-volume event stream) Yes — destroy() called in loop
TownContainerDO townId Yes (inherited from Container) Yes — Container class internal state No — never called
AgentIdentityDO agentIdentity No destroy() method No (stub with ping() only) N/A — no data to clean

Add to TownDO.destroy():

// Kill the container process and clear TownContainerDO state
try {
  const containerStub = getTownContainerStub(this.env, this.townId);
  await containerStub.destroy(); // Sends SIGKILL to container process
} catch {
  // Best-effort — container may already be stopped
}

When AgentIdentityDO gains real storage (Phase 3, #224), add cleanup for those instances as well.

Fix 3: TownDO alarm must have an exit condition

Add a deleted flag to TownDO storage (or check if the town has zero config/zero rigs). At the top of alarm(), check the flag and skip re-arming:

if (this.isDeleted()) {
  await this.ctx.storage.deleteAlarm();
  return; // Do NOT re-arm
}

Fix 4: Prevent alarm resurrection after destroy

In initializeDatabase(), check for a deleted sentinel before calling armAlarmIfNeeded(). If destroy() was called, it should write a sentinel value that survives deleteAll() (or use a different mechanism like a flag on the DO class instance).

Alternative: update the compatibility date to 2026-02-24 so deleteAll() also clears alarms, making resurrection less likely.

Fix 5: ensureContainerReady() should not ping containers for idle/deleted towns

Only call ensureContainerReady() if hasActiveWork() returns true. Idle towns don't need a running container. This lets the sleepAfter timer expire naturally for towns with no active work.

Fix 6: Orphan sweep (safety net)

Add a mechanism to detect and clean up orphaned DOs:

  • A Cron Trigger or admin endpoint that lists all active towns (from all GastownUserDOs) and compares against TownDOs that still have active alarms
  • Any TownDO not referenced by a GastownUserDO and with no active beads gets destroy() called
  • This is the catch-all for any leaks the primary fixes miss

Acceptance Criteria

  • tRPC deleteTown calls TownDO.destroy() and TownContainerDO.destroy()
  • TownDO.destroy() destroys the TownContainerDO (kills container process + clears state) and all AgentDOs (clears rig_agent_events tables)
  • All sub-DO storage is cleared: AgentDO event tables, TownContainerDO internal state, TownDO SQLite (beads, agents, rigs, events, config, convoy_metadata, review_metadata, bead_dependencies — everything)
  • TownDO alarm does not re-arm after deletion (exit condition + resurrection prevention)
  • ensureContainerReady() only pings containers for towns with active work
  • Deleting a town from the UI results in: alarm stopped, container killed, storage cleared
  • Orphan sweep mechanism for catching leaked DOs
  • Compatibility date updated to 2026-02-24 or later

Notes

  • This is actively burning Cloudflare resources in production right now
  • The fix for the primary bug (refactor(bot): Extract platform-agnostic tool system #1) is a one-line change — add townDOStub.destroy() to the tRPC mutation. The other fixes are defense-in-depth.
  • Cloudflare DOs cannot be truly "deleted" — they exist forever once created. But we can ensure they stop consuming resources by clearing storage, deleting alarms, and preventing resurrection.
  • After fixing, we should manually clean up the existing orphaned DOs. An admin script that calls destroy() on each known orphaned TownDO/TownContainerDO would suffice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions