Bead failure reasons — explain why beads failed, not just that they failed

## Parent

Part of #204 (Phase 4: Hardening)

## Problem

When beads transition to `failed`, users often have no idea why. Of the 16 code paths that can fail a bead, **8 record no reason or message** — the bead silently goes to `failed` with zero diagnostic context. Even when the container provides a `reason` parameter (e.g., agent crash details), it's discarded and never persisted.

This makes failed beads opaque and unrecoverable. Users can't distinguish "dispatch timed out 5 times" from "agent crashed" from "merge conflict" from "missing rig config." The only option is to delete the bead and try again, losing any context about what went wrong.

## Current State: Every Failure Path

### Good visibility (reason recorded)

| Path | Trigger | How reason is recorded |
|---|---|---|
| Refinery exit without merge | Refinery exits without closing MR bead | `review_completed` event with message |
| Invalid PR URL | Non-HTTPS pr_url from refinery | `review_completed` event + `pr_creation_failed` event |
| Review failed/conflict | Refinery reports failure or merge conflict | `review_completed` event with message |
| Triage CLOSE_BEAD | Triage agent decides to close bead | `triage_resolved` event with resolution_notes |
| PR closed without merge | External PR closed on GitHub/GitLab | `review_completed` event with "PR closed without merge" |

### Poor or no visibility (no reason recorded)

| Path | Trigger | File:Line | What's missing |
|---|---|---|---|
| **Dispatch exhaustion** | 5 failed container dispatches | `Town.do.ts:2953` | No indication of why dispatch failed. Most common automated failure path. |
| **Orphaned work** | Safety net for dispatch exhaustion | `patrol.ts:337` | Bare status change, no explanation |
| **Bead timeout** | `metadata.timeout_ms` exceeded | `patrol.ts:430` | Timeout duration logged to console only, not persisted |
| **Agent completed (failed)** | Container reports agent crash/failure | `review-queue.ts:624` | The `reason` parameter from the container **is discarded** — never written to any event |
| **No rig_id on MR bead** | MR bead missing rig_id | `Town.do.ts:3284` | Uses raw SQL `completeReview()` — no bead_event logged at all |
| **No rig config** | Rig config missing from KV | `Town.do.ts:3289` | Same — raw SQL, no event, no message |
| **Refinery container fail** | Container failed to start for refinery | `Town.do.ts:3406` | Same — raw SQL, no event |
| **Admin force-fail** | Manual admin action | `router.ts:1032` | Agent='admin' but no reason text |

### Additional problems

- **`completeReview()` bypasses `updateBeadStatus()`**: Paths #8, #9, #10 use raw SQL to set `status = 'failed'`, which means no `status_changed` bead_event is logged and no convoy progress update is triggered. These beads fail silently with zero audit trail.
- **`agentCompleted()` discards the `reason` parameter**: The container sends crash/failure details but `review-queue.ts:624` only uses `reason` for refinery rework messages, not for polecat failures. The reason is lost.

## Solution

Two-part fix: (1) persist a structured failure reason on every bead failure, and (2) render failure reasons in the UI with rich context.

### 1. Add `failure_reason` to bead events

Every `status_changed` event that transitions to `failed` should include a structured `failure_reason` in the event metadata:

```typescript
type FailureReason = {
  code: string;           // machine-readable: 'dispatch_exhausted', 'agent_crashed', 'timeout', 'merge_conflict', etc.
  message: string;        // human-readable summary
  details?: string;       // optional: stack trace, error output, container logs
  source: string;         // what triggered it: 'scheduler', 'patrol', 'refinery', 'triage', 'admin', 'container'
};
```

Each failure path gets a specific code and message:

| Path | `code` | `message` example |
|---|---|---|
| Dispatch exhaustion | `dispatch_exhausted` | "Agent failed to start 5 times. Last error: container timeout after 30s" |
| Orphaned work | `orphaned_work` | "Agent was idle with a hooked bead for over 30 minutes with no dispatch activity" |
| Bead timeout | `timeout` | "Bead exceeded its timeout of 60 minutes (in_progress for 72 minutes)" |
| Agent crash | `agent_crashed` | "Agent process exited with status: failed. Reason: out of memory" |
| No rig_id | `missing_rig_id` | "Merge request bead has no rig_id — cannot determine which rig to review" |
| No rig config | `missing_rig_config` | "Rig configuration not found in storage for rig xyz" |
| Container start fail | `container_start_failed` | "Failed to start refinery container: timeout waiting for server after 30s" |
| Admin force-fail | `admin_force_fail` | "Manually failed by admin" |

### 2. Fix `completeReview()` callers to use `updateBeadStatus()`

The three paths that use raw SQL (`Town.do.ts:3284`, `3289`, `3406`) should use `updateBeadStatus()` instead, so that:
- A `status_changed` bead_event is logged with the failure reason
- Convoy progress is updated
- The audit trail is complete

### 3. Persist `reason` from `agentCompleted()`

In `review-queue.ts:624`, when a polecat agent completes with `status: 'failed'`, write `input.reason` into the bead_event metadata. Currently only the refinery path uses the reason.

### 4. Render failure reasons in the UI

The bead detail panel should display failure reasons prominently:

- A red banner at the top of a failed bead showing the human-readable message
- Expandable details section with the failure code, source, and any additional context (error output, container logs)
- Timeline entry in the bead events with the full failure context

### 5. LLM-written failure summaries (stretch)

For complex failures (agent crashes with long error output, repeated dispatch failures with different errors each time), dispatch a short LLM call to summarize the failure in plain language. Store the summary alongside the raw failure data. This turns "agent_crashed: Error: Cannot find module '@/lib/auth' at Object.<anonymous> (/workspace/rigs/abc/...)" into "The agent crashed because it couldn't find the auth module — this usually means the dependency wasn't installed or the import path is wrong."

This could piggyback on the triage agent infrastructure (#442) or be a dedicated lightweight call.

## Acceptance Criteria

- [ ] Every code path that transitions a bead to `failed` includes a structured `failure_reason` in the bead_event metadata
- [ ] `completeReview()` raw SQL paths replaced with `updateBeadStatus()` for proper event logging
- [ ] `agentCompleted()` persists the container's `reason` parameter for polecat failures (not just refinery)
- [ ] Bead detail panel renders failure reason prominently (banner + expandable details)
- [ ] Failure reasons visible in the activity feed timeline
- [ ] Dispatch exhaustion captures the last dispatch error for context

## Notes

- No data migration needed — cloud Gastown hasn't deployed to production
- The `bead_events` table already supports a `metadata` JSON column — the failure reason fits naturally there
- The LLM summary (stretch goal) should only fire for failures with complex error output, not for simple cases like "dispatch exhausted" or "admin force-fail" where the message is already clear
- This overlaps with #228 (Observability) — structured failure reasons are a form of structured logging that both the admin panel (#897) and the user dashboard can consume


Path	Trigger	File:Line	What's missing
Dispatch exhaustion	5 failed container dispatches	`Town.do.ts:2953`	No indication of why dispatch failed. Most common automated failure path.
Orphaned work	Safety net for dispatch exhaustion	`patrol.ts:337`	Bare status change, no explanation
Bead timeout	`metadata.timeout_ms` exceeded	`patrol.ts:430`	Timeout duration logged to console only, not persisted
Agent completed (failed)	Container reports agent crash/failure	`review-queue.ts:624`	The `reason` parameter from the container is discarded — never written to any event
No rig_id on MR bead	MR bead missing rig_id	`Town.do.ts:3284`	Uses raw SQL `completeReview()` — no bead_event logged at all
No rig config	Rig config missing from KV	`Town.do.ts:3289`	Same — raw SQL, no event, no message
Refinery container fail	Container failed to start for refinery	`Town.do.ts:3406`	Same — raw SQL, no event
Admin force-fail	Manual admin action	`router.ts:1032`	Agent='admin' but no reason text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bead failure reasons — explain why beads failed, not just that they failed #1172

Parent

Problem

Current State: Every Failure Path

Good visibility (reason recorded)

Poor or no visibility (no reason recorded)

Additional problems

Solution

1. Add `failure_reason` to bead events

2. Fix `completeReview()` callers to use `updateBeadStatus()`

3. Persist `reason` from `agentCompleted()`

4. Render failure reasons in the UI

5. LLM-written failure summaries (stretch)

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Path	Trigger	How reason is recorded
Refinery exit without merge	Refinery exits without closing MR bead	`review_completed` event with message
Invalid PR URL	Non-HTTPS pr_url from refinery	`review_completed` event + `pr_creation_failed` event
Review failed/conflict	Refinery reports failure or merge conflict	`review_completed` event with message
Triage CLOSE_BEAD	Triage agent decides to close bead	`triage_resolved` event with resolution_notes
PR closed without merge	External PR closed on GitHub/GitLab	`review_completed` event with "PR closed without merge"

Path	`code`	`message` example
Dispatch exhaustion	`dispatch_exhausted`	"Agent failed to start 5 times. Last error: container timeout after 30s"
Orphaned work	`orphaned_work`	"Agent was idle with a hooked bead for over 30 minutes with no dispatch activity"
Bead timeout	`timeout`	"Bead exceeded its timeout of 60 minutes (in_progress for 72 minutes)"
Agent crash	`agent_crashed`	"Agent process exited with status: failed. Reason: out of memory"
No rig_id	`missing_rig_id`	"Merge request bead has no rig_id — cannot determine which rig to review"
No rig config	`missing_rig_config`	"Rig configuration not found in storage for rig xyz"
Container start fail	`container_start_failed`	"Failed to start refinery container: timeout waiting for server after 30s"
Admin force-fail	`admin_force_fail`	"Manually failed by admin"

Bead failure reasons — explain why beads failed, not just that they failed #1172

Description

Parent

Problem

Current State: Every Failure Path

Good visibility (reason recorded)

Poor or no visibility (no reason recorded)

Additional problems

Solution

1. Add failure_reason to bead events

2. Fix completeReview() callers to use updateBeadStatus()

3. Persist reason from agentCompleted()

4. Render failure reasons in the UI

5. LLM-written failure summaries (stretch)

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Add `failure_reason` to bead events

2. Fix `completeReview()` callers to use `updateBeadStatus()`

3. Persist `reason` from `agentCompleted()`