Parent
Sub-issue of #204 (Phase 4: Hardening)
Summary
Two in_progress issue beads (d93f38f3, f4ac6ea2) have been stuck for 1.5+ hours in town 8a6f9375. Their assigned agents (Toast, Maple) are idle with hooks and dispatch_attempts=2. The alarm is running (5s intervals, nextFireAt advancing). reconcileBeads Rule 3 (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min) should be resetting these beads to open, but isn't acting.
Observed State
- Beads:
in_progress, updated_at = 2026-03-21T05:51:10 (1.5h stale)
- Agents:
idle, hooked to the beads, dispatch_attempts=2, last_activity_at = 2026-03-21T05:55:50
- Alarm:
active (5s), nextFireAt advancing normally
- Recent events: last event at 05:56, no events for 1.5 hours despite alarm running
- Workers observability: 0 exceptions, 0 reconciler log messages
- No triage requests or GUPP escalations created for this problem
Diagnosis
The reconciler (reconcile() at reconciler.ts:299) appears to be silently failing or returning 0 actions every tick. If it were running correctly, Rule 3 would match both beads (stale in_progress, no working/stalled agent hooked, last_activity_at older than 90s).
Possible causes:
-
Zod parse failure in an earlier rule — reconcileAgents() runs before reconcileBeads(). If AgentRow.array().parse() throws (e.g., a new column added to AgentMetadataRecord.pick() but not selected in the query), the entire reconcile() throws. The catch at Town.do.ts:2977 logs to console.error which is invisible (DO alarm events aren't captured by Workers observability).
-
Phase 0 event drain blocking — events.drainEvents() at line 2927 runs before reconcile(). If it throws or hangs, Phase 1 never runs. The catch at line 2947 handles individual events but not a failure in drainEvents() itself.
-
Rule 3 fires but dispatch immediately re-sets in_progress — Rule 3 emits transition_bead(open) + clear_bead_assignee. But if Rule 1 or Rule 2 also matches in the same pass (idle+hooked agent with now-open bead), dispatch_agent immediately sets it back to in_progress. The net effect is a no-op with events logged. However, we see NO events for 1.5h, ruling this out.
Most likely: cause 1 (Zod parse failure)
The AgentRow schema picks last_event_type, last_event_at, active_tools from AgentMetadataRecord. These columns were recently added via ALTER TABLE. If a town's DO was created before the migration ran, these columns might not exist in the SQLite schema. The AgentRow fields are .nullable().optional() so missing values parse OK, but if the columns don't exist in the table at all, the SQL query itself would throw SqlStorageError: no such column.
Recommended Fix
-
Add try/catch per reconcile sub-function — instead of one big reconcile(), catch errors in each reconcileAgents(), reconcileBeads(), etc. so a failure in one doesn't block all others.
-
Emit a health event from the reconciler — Write a reconciler_tick town_event every N ticks (e.g. every 60s) with metrics. This makes reconciler health visible via the debug endpoint.
-
Investigate the actual error — Add a temporary debug field to the /debug/towns/:id/status response that runs reconcile() in a try/catch and returns the error if it throws.
Affected Code
cloudflare-gastown/src/dos/Town.do.ts:2954-2979 — reconciler Phase 1
cloudflare-gastown/src/dos/town/reconciler.ts:299-306 — reconcile() top-level
cloudflare-gastown/src/dos/town/reconciler.ts:326-361 — reconcileAgents() query + parse
Parent
Sub-issue of #204 (Phase 4: Hardening)
Summary
Two
in_progressissue beads (d93f38f3,f4ac6ea2) have been stuck for 1.5+ hours in town8a6f9375. Their assigned agents (Toast, Maple) areidlewith hooks anddispatch_attempts=2. The alarm is running (5s intervals,nextFireAtadvancing).reconcileBeadsRule 3 (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min) should be resetting these beads toopen, but isn't acting.Observed State
in_progress,updated_at = 2026-03-21T05:51:10(1.5h stale)idle, hooked to the beads,dispatch_attempts=2,last_activity_at = 2026-03-21T05:55:50active (5s),nextFireAtadvancing normallyDiagnosis
The reconciler (
reconcile()atreconciler.ts:299) appears to be silently failing or returning 0 actions every tick. If it were running correctly, Rule 3 would match both beads (stalein_progress, noworking/stalledagent hooked,last_activity_atolder than 90s).Possible causes:
Zod parse failure in an earlier rule —
reconcileAgents()runs beforereconcileBeads(). IfAgentRow.array().parse()throws (e.g., a new column added toAgentMetadataRecord.pick()but not selected in the query), the entirereconcile()throws. The catch atTown.do.ts:2977logs toconsole.errorwhich is invisible (DO alarm events aren't captured by Workers observability).Phase 0 event drain blocking —
events.drainEvents()at line 2927 runs beforereconcile(). If it throws or hangs, Phase 1 never runs. The catch at line 2947 handles individual events but not a failure indrainEvents()itself.Rule 3 fires but dispatch immediately re-sets in_progress — Rule 3 emits
transition_bead(open)+clear_bead_assignee. But if Rule 1 or Rule 2 also matches in the same pass (idle+hooked agent with now-open bead),dispatch_agentimmediately sets it back toin_progress. The net effect is a no-op with events logged. However, we see NO events for 1.5h, ruling this out.Most likely: cause 1 (Zod parse failure)
The
AgentRowschema pickslast_event_type,last_event_at,active_toolsfromAgentMetadataRecord. These columns were recently added viaALTER TABLE. If a town's DO was created before the migration ran, these columns might not exist in the SQLite schema. TheAgentRowfields are.nullable().optional()so missing values parse OK, but if the columns don't exist in the table at all, the SQL query itself would throwSqlStorageError: no such column.Recommended Fix
Add try/catch per reconcile sub-function — instead of one big
reconcile(), catch errors in eachreconcileAgents(),reconcileBeads(), etc. so a failure in one doesn't block all others.Emit a health event from the reconciler — Write a
reconciler_ticktown_event every N ticks (e.g. every 60s) with metrics. This makes reconciler health visible via the debug endpoint.Investigate the actual error — Add a temporary debug field to the
/debug/towns/:id/statusresponse that runsreconcile()in a try/catch and returns the error if it throws.Affected Code
cloudflare-gastown/src/dos/Town.do.ts:2954-2979— reconciler Phase 1cloudflare-gastown/src/dos/town/reconciler.ts:299-306—reconcile()top-levelcloudflare-gastown/src/dos/town/reconciler.ts:326-361—reconcileAgents()query + parse