Skip to content

Bead failure reasons — explain why beads failed, not just that they failed #1172

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 4: Hardening)

Problem

When beads transition to failed, users often have no idea why. Of the 16 code paths that can fail a bead, 8 record no reason or message — the bead silently goes to failed with zero diagnostic context. Even when the container provides a reason parameter (e.g., agent crash details), it's discarded and never persisted.

This makes failed beads opaque and unrecoverable. Users can't distinguish "dispatch timed out 5 times" from "agent crashed" from "merge conflict" from "missing rig config." The only option is to delete the bead and try again, losing any context about what went wrong.

Current State: Every Failure Path

Good visibility (reason recorded)

Path Trigger How reason is recorded
Refinery exit without merge Refinery exits without closing MR bead review_completed event with message
Invalid PR URL Non-HTTPS pr_url from refinery review_completed event + pr_creation_failed event
Review failed/conflict Refinery reports failure or merge conflict review_completed event with message
Triage CLOSE_BEAD Triage agent decides to close bead triage_resolved event with resolution_notes
PR closed without merge External PR closed on GitHub/GitLab review_completed event with "PR closed without merge"

Poor or no visibility (no reason recorded)

Path Trigger File:Line What's missing
Dispatch exhaustion 5 failed container dispatches Town.do.ts:2953 No indication of why dispatch failed. Most common automated failure path.
Orphaned work Safety net for dispatch exhaustion patrol.ts:337 Bare status change, no explanation
Bead timeout metadata.timeout_ms exceeded patrol.ts:430 Timeout duration logged to console only, not persisted
Agent completed (failed) Container reports agent crash/failure review-queue.ts:624 The reason parameter from the container is discarded — never written to any event
No rig_id on MR bead MR bead missing rig_id Town.do.ts:3284 Uses raw SQL completeReview() — no bead_event logged at all
No rig config Rig config missing from KV Town.do.ts:3289 Same — raw SQL, no event, no message
Refinery container fail Container failed to start for refinery Town.do.ts:3406 Same — raw SQL, no event
Admin force-fail Manual admin action router.ts:1032 Agent='admin' but no reason text

Additional problems

Solution

Two-part fix: (1) persist a structured failure reason on every bead failure, and (2) render failure reasons in the UI with rich context.

1. Add failure_reason to bead events

Every status_changed event that transitions to failed should include a structured failure_reason in the event metadata:

type FailureReason = {
  code: string;           // machine-readable: 'dispatch_exhausted', 'agent_crashed', 'timeout', 'merge_conflict', etc.
  message: string;        // human-readable summary
  details?: string;       // optional: stack trace, error output, container logs
  source: string;         // what triggered it: 'scheduler', 'patrol', 'refinery', 'triage', 'admin', 'container'
};

Each failure path gets a specific code and message:

Path code message example
Dispatch exhaustion dispatch_exhausted "Agent failed to start 5 times. Last error: container timeout after 30s"
Orphaned work orphaned_work "Agent was idle with a hooked bead for over 30 minutes with no dispatch activity"
Bead timeout timeout "Bead exceeded its timeout of 60 minutes (in_progress for 72 minutes)"
Agent crash agent_crashed "Agent process exited with status: failed. Reason: out of memory"
No rig_id missing_rig_id "Merge request bead has no rig_id — cannot determine which rig to review"
No rig config missing_rig_config "Rig configuration not found in storage for rig xyz"
Container start fail container_start_failed "Failed to start refinery container: timeout waiting for server after 30s"
Admin force-fail admin_force_fail "Manually failed by admin"

2. Fix completeReview() callers to use updateBeadStatus()

The three paths that use raw SQL (Town.do.ts:3284, 3289, 3406) should use updateBeadStatus() instead, so that:

  • A status_changed bead_event is logged with the failure reason
  • Convoy progress is updated
  • The audit trail is complete

3. Persist reason from agentCompleted()

In review-queue.ts:624, when a polecat agent completes with status: 'failed', write input.reason into the bead_event metadata. Currently only the refinery path uses the reason.

4. Render failure reasons in the UI

The bead detail panel should display failure reasons prominently:

  • A red banner at the top of a failed bead showing the human-readable message
  • Expandable details section with the failure code, source, and any additional context (error output, container logs)
  • Timeline entry in the bead events with the full failure context

5. LLM-written failure summaries (stretch)

For complex failures (agent crashes with long error output, repeated dispatch failures with different errors each time), dispatch a short LLM call to summarize the failure in plain language. Store the summary alongside the raw failure data. This turns "agent_crashed: Error: Cannot find module '@/lib/auth' at Object. (/workspace/rigs/abc/...)" into "The agent crashed because it couldn't find the auth module — this usually means the dependency wasn't installed or the import path is wrong."

This could piggyback on the triage agent infrastructure (#442) or be a dedicated lightweight call.

Acceptance Criteria

  • Every code path that transitions a bead to failed includes a structured failure_reason in the bead_event metadata
  • completeReview() raw SQL paths replaced with updateBeadStatus() for proper event logging
  • agentCompleted() persists the container's reason parameter for polecat failures (not just refinery)
  • Bead detail panel renders failure reason prominently (banner + expandable details)
  • Failure reasons visible in the activity feed timeline
  • Dispatch exhaustion captures the last dispatch error for context

Notes

  • No data migration needed — cloud Gastown hasn't deployed to production
  • The bead_events table already supports a metadata JSON column — the failure reason fits naturally there
  • The LLM summary (stretch goal) should only fire for failures with complex error output, not for simple cases like "dispatch exhausted" or "admin force-fail" where the message is already clear
  • This overlaps with [Gastown] PR 22: Observability #228 (Observability) — structured failure reasons are a form of structured logging that both the admin panel (Gastown User-Scoped Admin Panel — Inspect & Intervene via GastownUserDO → TownDO #897) and the user dashboard can consume

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Should fix before soft launchenhancementNew feature or requestgt:coreReconciler, state machine, bead lifecycle, convoy flowgt:uiDashboard, settings, terminal, drawerskilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions