You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deleting a town from the UI does not clean up the underlying Durable Objects or containers. With only 3 active users and 4 active towns, there are tens of TownDOs still running alarm loops and ~10 extra TownContainerDOs consuming container instances. Deleted towns continue running indefinitely, burning Cloudflare resources.
Root Causes (6 compounding issues)
1. tRPC deleteTown doesn't call TownDO.destroy() — THE PRIMARY BUG
File:src/trpc/router.ts:156-162
The UI calls deleteTown via tRPC, which calls GastownUserDO.deleteTown(). This only deletes rows from the user_towns and user_rigs tables in GastownUserDO's storage. It never touches the TownDO, TownContainerDO, or AgentDOs. The town is removed from the user's list but continues running everything.
The HTTP handler at src/handlers/towns.handler.ts:128-146does call townDOStub.destroy(), but the tRPC path (which the UI uses) doesn't.
2. Neither deletion path destroys the TownContainerDO
TownDO.destroy() at Town.do.ts:4044-4058 deletes agents and clears TownDO storage, but never calls getTownContainerStub().destroy(). The HTTP handler also doesn't destroy the container. No code path stops the container when a town is deleted.
There is no exit condition. Even idle towns with no work fire every 60 seconds forever. The alarm never checks if the town has been deleted.
4. TownDO alarm keeps the container alive
ensureContainerReady() at Town.do.ts:3775 calls container.fetch('http://container/health') on every alarm tick when the town has rigs. Each health check resets the TownContainerDO's sleepAfter 30-minute timer. Since the alarm fires every 5-60 seconds, the container never sleeps — even for deleted towns that still have rigs in their tables.
5. Constructor re-arms alarm on any interaction
initializeDatabase() at Town.do.ts:437-439 calls armAlarmIfNeeded(). Even after a successful destroy(), if anything touches the DO stub (a stale request, dashboard polling, a late API call), the constructor fires and re-creates the alarm loop from scratch — on a DO with empty tables. This is a resurrection problem.
6. Compatibility date too old for deleteAll() to clear alarms
wrangler.jsonc:5 has compatibility date 2026-01-27. Cloudflare added deleteAll() clearing alarms at 2026-02-24. The code compensates by calling deleteAlarm() explicitly in destroy(), but this is fragile when destroy() is never called (root cause #1).
Resource Leak Summary
Resource
tRPC deleteTown (UI path)
HTTP DELETE handler
GastownUserDO rows
Deleted
Deleted
TownDO storage
Never cleared
Cleared via destroy()
TownDO alarm
Never deleted — fires every 5-60s forever
Deleted, but can resurrect
TownContainerDO
Never stopped
Never stopped
AgentDOs
Never cleaned
Cleaned via TownDO.destroy()
Container process
Runs indefinitely
Runs indefinitely
Fix
Fix 1: tRPC deleteTown must call TownDO.destroy()
In src/trpc/router.ts:156-162, after verifyTownOwnership, call townDOStub.destroy() before (or alongside) userStub.deleteTown(). Match the HTTP handler's behavior.
Fix 2: TownDO.destroy() must destroy ALL sub-DOs and their storage
TownDO.destroy() at Town.do.ts:4044-4058 currently iterates AgentDOs and calls their destroy() (which clears their agent_events tables and alarms). But it's missing:
Sub-DOs with independent storage that must be cleaned up:
// Kill the container process and clear TownContainerDO statetry{constcontainerStub=getTownContainerStub(this.env,this.townId);awaitcontainerStub.destroy();// Sends SIGKILL to container process}catch{// Best-effort — container may already be stopped}
When AgentIdentityDO gains real storage (Phase 3, #224), add cleanup for those instances as well.
Fix 3: TownDO alarm must have an exit condition
Add a deleted flag to TownDO storage (or check if the town has zero config/zero rigs). At the top of alarm(), check the flag and skip re-arming:
if(this.isDeleted()){awaitthis.ctx.storage.deleteAlarm();return;// Do NOT re-arm}
Fix 4: Prevent alarm resurrection after destroy
In initializeDatabase(), check for a deleted sentinel before calling armAlarmIfNeeded(). If destroy() was called, it should write a sentinel value that survives deleteAll() (or use a different mechanism like a flag on the DO class instance).
Alternative: update the compatibility date to 2026-02-24 so deleteAll() also clears alarms, making resurrection less likely.
Fix 5: ensureContainerReady() should not ping containers for idle/deleted towns
Only call ensureContainerReady() if hasActiveWork() returns true. Idle towns don't need a running container. This lets the sleepAfter timer expire naturally for towns with no active work.
Fix 6: Orphan sweep (safety net)
Add a mechanism to detect and clean up orphaned DOs:
A Cron Trigger or admin endpoint that lists all active towns (from all GastownUserDOs) and compares against TownDOs that still have active alarms
Any TownDO not referenced by a GastownUserDO and with no active beads gets destroy() called
This is the catch-all for any leaks the primary fixes miss
Acceptance Criteria
tRPC deleteTown calls TownDO.destroy() and TownContainerDO.destroy()
TownDO.destroy() destroys the TownContainerDO (kills container process + clears state) and all AgentDOs (clears rig_agent_events tables)
Cloudflare DOs cannot be truly "deleted" — they exist forever once created. But we can ensure they stop consuming resources by clearing storage, deleting alarms, and preventing resurrection.
After fixing, we should manually clean up the existing orphaned DOs. An admin script that calls destroy() on each known orphaned TownDO/TownContainerDO would suffice.
Parent
Part of #204 (Phase 3: Multi-Rig + Scaling)
Problem
Deleting a town from the UI does not clean up the underlying Durable Objects or containers. With only 3 active users and 4 active towns, there are tens of TownDOs still running alarm loops and ~10 extra TownContainerDOs consuming container instances. Deleted towns continue running indefinitely, burning Cloudflare resources.
Root Causes (6 compounding issues)
1. tRPC
deleteTowndoesn't callTownDO.destroy()— THE PRIMARY BUGFile:
src/trpc/router.ts:156-162The UI calls
deleteTownvia tRPC, which callsGastownUserDO.deleteTown(). This only deletes rows from theuser_townsanduser_rigstables in GastownUserDO's storage. It never touches the TownDO, TownContainerDO, or AgentDOs. The town is removed from the user's list but continues running everything.The HTTP handler at
src/handlers/towns.handler.ts:128-146does calltownDOStub.destroy(), but the tRPC path (which the UI uses) doesn't.2. Neither deletion path destroys the TownContainerDO
TownDO.destroy()atTown.do.ts:4044-4058deletes agents and clears TownDO storage, but never callsgetTownContainerStub().destroy(). The HTTP handler also doesn't destroy the container. No code path stops the container when a town is deleted.3. TownDO alarm unconditionally re-arms
File:
Town.do.ts:2638-2640There is no exit condition. Even idle towns with no work fire every 60 seconds forever. The alarm never checks if the town has been deleted.
4. TownDO alarm keeps the container alive
ensureContainerReady()atTown.do.ts:3775callscontainer.fetch('http://container/health')on every alarm tick when the town has rigs. Each health check resets the TownContainerDO'ssleepAfter30-minute timer. Since the alarm fires every 5-60 seconds, the container never sleeps — even for deleted towns that still have rigs in their tables.5. Constructor re-arms alarm on any interaction
initializeDatabase()atTown.do.ts:437-439callsarmAlarmIfNeeded(). Even after a successfuldestroy(), if anything touches the DO stub (a stale request, dashboard polling, a late API call), the constructor fires and re-creates the alarm loop from scratch — on a DO with empty tables. This is a resurrection problem.6. Compatibility date too old for
deleteAll()to clear alarmswrangler.jsonc:5has compatibility date2026-01-27. Cloudflare addeddeleteAll()clearing alarms at2026-02-24. The code compensates by callingdeleteAlarm()explicitly indestroy(), but this is fragile whendestroy()is never called (root cause #1).Resource Leak Summary
deleteTown(UI path)destroy()TownDO.destroy()Fix
Fix 1: tRPC
deleteTownmust callTownDO.destroy()In
src/trpc/router.ts:156-162, afterverifyTownOwnership, calltownDOStub.destroy()before (or alongside)userStub.deleteTown(). Match the HTTP handler's behavior.Fix 2:
TownDO.destroy()must destroy ALL sub-DOs and their storageTownDO.destroy()atTown.do.ts:4044-4058currently iterates AgentDOs and calls theirdestroy()(which clears theiragent_eventstables and alarms). But it's missing:Sub-DOs with independent storage that must be cleaned up:
destroy()?agentIdAgent.do.ts:113)rig_agent_eventstable (high-volume event stream)destroy()called in looptownIdContainer)agentIdentitydestroy()methodping()only)Add to
TownDO.destroy():When AgentIdentityDO gains real storage (Phase 3, #224), add cleanup for those instances as well.
Fix 3: TownDO alarm must have an exit condition
Add a
deletedflag to TownDO storage (or check if the town has zero config/zero rigs). At the top ofalarm(), check the flag and skip re-arming:Fix 4: Prevent alarm resurrection after destroy
In
initializeDatabase(), check for adeletedsentinel before callingarmAlarmIfNeeded(). Ifdestroy()was called, it should write a sentinel value that survivesdeleteAll()(or use a different mechanism like a flag on the DO class instance).Alternative: update the compatibility date to
2026-02-24sodeleteAll()also clears alarms, making resurrection less likely.Fix 5:
ensureContainerReady()should not ping containers for idle/deleted townsOnly call
ensureContainerReady()ifhasActiveWork()returns true. Idle towns don't need a running container. This lets thesleepAftertimer expire naturally for towns with no active work.Fix 6: Orphan sweep (safety net)
Add a mechanism to detect and clean up orphaned DOs:
destroy()calledAcceptance Criteria
deleteTowncallsTownDO.destroy()andTownContainerDO.destroy()TownDO.destroy()destroys the TownContainerDO (kills container process + clears state) and all AgentDOs (clearsrig_agent_eventstables)ensureContainerReady()only pings containers for towns with active work2026-02-24or laterNotes
townDOStub.destroy()to the tRPC mutation. The other fixes are defense-in-depth.destroy()on each known orphaned TownDO/TownContainerDO would suffice.