You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When beads transition to failed, users often have no idea why. Of the 16 code paths that can fail a bead, 8 record no reason or message — the bead silently goes to failed with zero diagnostic context. Even when the container provides a reason parameter (e.g., agent crash details), it's discarded and never persisted.
This makes failed beads opaque and unrecoverable. Users can't distinguish "dispatch timed out 5 times" from "agent crashed" from "merge conflict" from "missing rig config." The only option is to delete the bead and try again, losing any context about what went wrong.
Current State: Every Failure Path
Good visibility (reason recorded)
Path
Trigger
How reason is recorded
Refinery exit without merge
Refinery exits without closing MR bead
review_completed event with message
Invalid PR URL
Non-HTTPS pr_url from refinery
review_completed event + pr_creation_failed event
Review failed/conflict
Refinery reports failure or merge conflict
review_completed event with message
Triage CLOSE_BEAD
Triage agent decides to close bead
triage_resolved event with resolution_notes
PR closed without merge
External PR closed on GitHub/GitLab
review_completed event with "PR closed without merge"
Poor or no visibility (no reason recorded)
Path
Trigger
File:Line
What's missing
Dispatch exhaustion
5 failed container dispatches
Town.do.ts:2953
No indication of why dispatch failed. Most common automated failure path.
Orphaned work
Safety net for dispatch exhaustion
patrol.ts:337
Bare status change, no explanation
Bead timeout
metadata.timeout_ms exceeded
patrol.ts:430
Timeout duration logged to console only, not persisted
Agent completed (failed)
Container reports agent crash/failure
review-queue.ts:624
The reason parameter from the container is discarded — never written to any event
No rig_id on MR bead
MR bead missing rig_id
Town.do.ts:3284
Uses raw SQL completeReview() — no bead_event logged at all
agentCompleted() discards the reason parameter: The container sends crash/failure details but review-queue.ts:624 only uses reason for refinery rework messages, not for polecat failures. The reason is lost.
Solution
Two-part fix: (1) persist a structured failure reason on every bead failure, and (2) render failure reasons in the UI with rich context.
1. Add failure_reason to bead events
Every status_changed event that transitions to failed should include a structured failure_reason in the event metadata:
Each failure path gets a specific code and message:
Path
code
message example
Dispatch exhaustion
dispatch_exhausted
"Agent failed to start 5 times. Last error: container timeout after 30s"
Orphaned work
orphaned_work
"Agent was idle with a hooked bead for over 30 minutes with no dispatch activity"
Bead timeout
timeout
"Bead exceeded its timeout of 60 minutes (in_progress for 72 minutes)"
Agent crash
agent_crashed
"Agent process exited with status: failed. Reason: out of memory"
No rig_id
missing_rig_id
"Merge request bead has no rig_id — cannot determine which rig to review"
No rig config
missing_rig_config
"Rig configuration not found in storage for rig xyz"
Container start fail
container_start_failed
"Failed to start refinery container: timeout waiting for server after 30s"
Admin force-fail
admin_force_fail
"Manually failed by admin"
2. Fix completeReview() callers to use updateBeadStatus()
The three paths that use raw SQL (Town.do.ts:3284, 3289, 3406) should use updateBeadStatus() instead, so that:
A status_changed bead_event is logged with the failure reason
Convoy progress is updated
The audit trail is complete
3. Persist reason from agentCompleted()
In review-queue.ts:624, when a polecat agent completes with status: 'failed', write input.reason into the bead_event metadata. Currently only the refinery path uses the reason.
4. Render failure reasons in the UI
The bead detail panel should display failure reasons prominently:
A red banner at the top of a failed bead showing the human-readable message
Expandable details section with the failure code, source, and any additional context (error output, container logs)
Timeline entry in the bead events with the full failure context
5. LLM-written failure summaries (stretch)
For complex failures (agent crashes with long error output, repeated dispatch failures with different errors each time), dispatch a short LLM call to summarize the failure in plain language. Store the summary alongside the raw failure data. This turns "agent_crashed: Error: Cannot find module '@/lib/auth' at Object. (/workspace/rigs/abc/...)" into "The agent crashed because it couldn't find the auth module — this usually means the dependency wasn't installed or the import path is wrong."
This could piggyback on the triage agent infrastructure (#442) or be a dedicated lightweight call.
Acceptance Criteria
Every code path that transitions a bead to failed includes a structured failure_reason in the bead_event metadata
completeReview() raw SQL paths replaced with updateBeadStatus() for proper event logging
agentCompleted() persists the container's reason parameter for polecat failures (not just refinery)
Failure reasons visible in the activity feed timeline
Dispatch exhaustion captures the last dispatch error for context
Notes
No data migration needed — cloud Gastown hasn't deployed to production
The bead_events table already supports a metadata JSON column — the failure reason fits naturally there
The LLM summary (stretch goal) should only fire for failures with complex error output, not for simple cases like "dispatch exhausted" or "admin force-fail" where the message is already clear
Parent
Part of #204 (Phase 4: Hardening)
Problem
When beads transition to
failed, users often have no idea why. Of the 16 code paths that can fail a bead, 8 record no reason or message — the bead silently goes tofailedwith zero diagnostic context. Even when the container provides areasonparameter (e.g., agent crash details), it's discarded and never persisted.This makes failed beads opaque and unrecoverable. Users can't distinguish "dispatch timed out 5 times" from "agent crashed" from "merge conflict" from "missing rig config." The only option is to delete the bead and try again, losing any context about what went wrong.
Current State: Every Failure Path
Good visibility (reason recorded)
review_completedevent with messagereview_completedevent +pr_creation_failedeventreview_completedevent with messagetriage_resolvedevent with resolution_notesreview_completedevent with "PR closed without merge"Poor or no visibility (no reason recorded)
Town.do.ts:2953patrol.ts:337metadata.timeout_msexceededpatrol.ts:430review-queue.ts:624reasonparameter from the container is discarded — never written to any eventTown.do.ts:3284completeReview()— no bead_event logged at allTown.do.ts:3289Town.do.ts:3406router.ts:1032Additional problems
completeReview()bypassesupdateBeadStatus(): Paths App Builder - Add preview URL tracking with navigation controls #8, feat(share): implement shared session page with database integration #9, feat(models): add kilo/auto model with centralized configuration #10 use raw SQL to setstatus = 'failed', which means nostatus_changedbead_event is logged and no convoy progress update is triggered. These beads fail silently with zero audit trail.agentCompleted()discards thereasonparameter: The container sends crash/failure details butreview-queue.ts:624only usesreasonfor refinery rework messages, not for polecat failures. The reason is lost.Solution
Two-part fix: (1) persist a structured failure reason on every bead failure, and (2) render failure reasons in the UI with rich context.
1. Add
failure_reasonto bead eventsEvery
status_changedevent that transitions tofailedshould include a structuredfailure_reasonin the event metadata:Each failure path gets a specific code and message:
codemessageexampledispatch_exhaustedorphaned_worktimeoutagent_crashedmissing_rig_idmissing_rig_configcontainer_start_failedadmin_force_fail2. Fix
completeReview()callers to useupdateBeadStatus()The three paths that use raw SQL (
Town.do.ts:3284,3289,3406) should useupdateBeadStatus()instead, so that:status_changedbead_event is logged with the failure reason3. Persist
reasonfromagentCompleted()In
review-queue.ts:624, when a polecat agent completes withstatus: 'failed', writeinput.reasoninto the bead_event metadata. Currently only the refinery path uses the reason.4. Render failure reasons in the UI
The bead detail panel should display failure reasons prominently:
5. LLM-written failure summaries (stretch)
For complex failures (agent crashes with long error output, repeated dispatch failures with different errors each time), dispatch a short LLM call to summarize the failure in plain language. Store the summary alongside the raw failure data. This turns "agent_crashed: Error: Cannot find module '@/lib/auth' at Object. (/workspace/rigs/abc/...)" into "The agent crashed because it couldn't find the auth module — this usually means the dependency wasn't installed or the import path is wrong."
This could piggyback on the triage agent infrastructure (#442) or be a dedicated lightweight call.
Acceptance Criteria
failedincludes a structuredfailure_reasonin the bead_event metadatacompleteReview()raw SQL paths replaced withupdateBeadStatus()for proper event loggingagentCompleted()persists the container'sreasonparameter for polecat failures (not just refinery)Notes
bead_eventstable already supports ametadataJSON column — the failure reason fits naturally there