Add diagnostics framework — squad doctor, blame attribution, troubleshooting

## Problem

When Squad misbehaves, it's extremely difficult to determine **where** the problem originates. There are at least 4 distinct failure layers, and symptoms often look identical across them:

| Layer | Example Failure | What It Looks Like |
|-------|----------------|-------------------|
| **Squad governance** (`squad.agent.md`) | Bad routing rule, missing spawn template field, stale ceremony config | Agent does wrong thing or skips steps |
| **Copilot CLI / platform** | Tool unavailable, background agent silent success, context overflow | Agent appears to do nothing, partial results |
| **LLM behavior** | Model ignores instructions, hallucinates file paths, skips archival step | Agent does work but incorrectly or incompletely |
| **User `.squad/` state** | Bloated `decisions.md`, corrupt `registry.json`, stale history, missing charter | Agents slow, confused, or fail to spawn |

### Why This Matters

Users (and the coordinator itself) currently have **no systematic way** to diagnose issues. When something goes wrong, the debugging process is:
1. User notices bad behavior
2. User guesses which layer caused it
3. User manually inspects files, logs, or re-runs with different prompts
4. Repeat until fixed (or give up)

This is especially painful because:
- **LLM non-determinism** means the same input can succeed or fail across runs
- **Silent success bugs** (~7-10% of background spawns) mean agents complete file writes but return no text
- **Context overflow** after multi-agent fan-out causes server error retry loops
- **Bloated state files** (decisions.md at 145KB) cause subtle degradation, not hard failures

## Proposed: Diagnostic Framework

### 1. `squad doctor` CLI Command

A health check command that inspects `.squad/` state and reports issues:

```bash
npx squad doctor
```

**Checks:**
- [ ] `.squad/team.md` exists and has `## Members` section with entries
- [ ] All agents in roster have matching `agents/{name}/charter.md`
- [ ] All agents in roster have matching `agents/{name}/history.md`
- [ ] `decisions.md` size check (warn >20KB, error >50KB)
- [ ] `decisions-archive.md` exists if decisions.md has been archived before
- [ ] Each `history.md` size check (warn >12KB, error >30KB)
- [ ] `casting/registry.json` is valid JSON with all roster agents
- [ ] `casting/policy.json` exists and is valid
- [ ] `decisions/inbox/` is empty (non-empty = Scribe didn't merge)
- [ ] `config.json` is valid JSON (if exists)
- [ ] No orphaned agent directories (in `agents/` but not in roster)
- [ ] No missing agent directories (in roster but not in `agents/`)
- [ ] `routing.md` references only agents that exist in roster
- [ ] `.gitattributes` has union merge drivers for append-only files
- [ ] Total `.squad/` directory size (warn >1MB, error >5MB)

**Output:**
```
🏥 Squad Doctor — Health Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Team roster: 6 members found
✅ Agent directories: all present
⚠️  decisions.md: 45KB (recommended: <20KB) — run "squad nap" to archive
✅ History files: all under 12KB
❌ decisions/inbox: 3 unmerged files — Scribe may have failed
✅ Casting state: valid
✅ .gitattributes: merge drivers configured
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Result: 1 error, 1 warning — run "squad fix" to auto-repair
```

### 2. `squad doctor --layer` Focused Diagnostics

Target a specific layer:

```bash
npx squad doctor --layer state     # Check .squad/ file health
npx squad doctor --layer governance # Validate squad.agent.md structure
npx squad doctor --layer platform   # Check tool availability, model access
npx squad doctor --layer all        # Full diagnostic (default)
```

### 3. `squad fix` Auto-Repair

For issues that have safe, deterministic fixes:

```bash
npx squad fix
```

**Auto-fixable:**
- [ ] Empty `decisions/inbox/` by running Scribe merge
- [ ] Archive decisions.md entries >30 days old
- [ ] Summarize history.md files >12KB
- [ ] Create missing agent directories from roster
- [ ] Initialize missing `casting/` state from existing agents
- [ ] Add missing `.gitattributes` merge drivers

**NOT auto-fixable (report only):**
- Corrupt JSON files (need manual inspection)
- Roster/routing mismatches (need human decision)
- LLM behavioral issues (need prompt tuning)

### 4. Diagnostic Logging in Coordinator

Add structured diagnostic breadcrumbs to the orchestration log:

```markdown
## Diagnostic Context
- **decisions.md size:** 23KB (⚠️ over 20KB threshold)
- **Spawn model:** claude-sonnet-4.5 (Layer 3: task-aware)
- **Agent charter:** 2.1KB ✅
- **Agent history:** 8.4KB ✅
- **Estimated spawn context:** ~18K tokens
- **Background mode:** yes
- **MCP tools available:** github ✅, azure ❌
```

This gives post-mortem investigators a snapshot of the state at spawn time.

### 5. Blame Attribution Heuristics

When something goes wrong, the coordinator (or a diagnostic agent) can use these signals:

| Symptom | Likely Layer | Diagnostic |
|---------|-------------|------------|
| Agent does nothing, no files written | **Platform** | Silent success bug. Check `read_agent` status. |
| Agent writes files but returns no text | **Platform** | Known ~7-10% bug. Files verify success. |
| Agent does wrong task or skips steps | **Governance** | Check charter.md, spawn prompt, routing.md |
| Agent hallucinates file paths | **LLM** | Model-specific. Try different model. |
| Agent is slow or runs out of context | **State** | Check decisions.md + history.md sizes |
| Scribe skips archival | **LLM + State** | Haiku model + large file = skipped steps |
| Agent reads stale decisions | **State** | Scribe merge didn't run; check inbox/ |
| Ceremony doesn't trigger | **Governance** | Check ceremonies.md conditions |
| Wrong agent gets the work | **Governance** | Check routing.md rules |
| `read_agent` returns server error | **Platform** | Context overflow after fan-out |

### 6. User-Facing Troubleshooting Guide

A doc page or skill that helps users self-diagnose:

```
## Quick Troubleshooting

### "My agent did nothing"
1. Run `squad doctor` — check for state issues
2. Check orchestration log for the spawn entry
3. Was it a background spawn? Check `read_agent` result
4. Try re-running with `--verbose` logging

### "Decisions keep growing"  
1. Run `squad doctor --layer state` — check decisions.md size
2. Run `squad nap` to archive old decisions
3. Check if Scribe is being spawned after work batches
4. Add a "Decisions Cleanup" ceremony for periodic maintenance

### "Agent gave wrong answer"
1. Check the agent's charter — does it cover this domain?
2. Check routing.md — was the right agent selected?
3. Check decisions.md — are there conflicting decisions?
4. Try a different model (bump to sonnet/opus for complex tasks)
```

## Implementation Plan

1. **Phase 1: `squad doctor` command** — state health checks (most value, least risk)
2. **Phase 2: Diagnostic logging** — add context breadcrumbs to orchestration log entries
3. **Phase 3: `squad fix` command** — auto-repair for safe issues
4. **Phase 4: Blame heuristics** — teach the coordinator to self-diagnose
5. **Phase 5: Troubleshooting guide** — user-facing docs

## Labels

`squad`, `enhancement`, `squad:lead`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add diagnostics framework — squad doctor, blame attribution, troubleshooting #21

Problem

Why This Matters

Proposed: Diagnostic Framework

1. `squad doctor` CLI Command

2. `squad doctor --layer` Focused Diagnostics

3. `squad fix` Auto-Repair

4. Diagnostic Logging in Coordinator

5. Blame Attribution Heuristics

6. User-Facing Troubleshooting Guide

Implementation Plan

Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Layer	Example Failure	What It Looks Like
Squad governance (`squad.agent.md`)	Bad routing rule, missing spawn template field, stale ceremony config	Agent does wrong thing or skips steps
Copilot CLI / platform	Tool unavailable, background agent silent success, context overflow	Agent appears to do nothing, partial results
LLM behavior	Model ignores instructions, hallucinates file paths, skips archival step	Agent does work but incorrectly or incompletely
User `.squad/` state	Bloated `decisions.md`, corrupt `registry.json`, stale history, missing charter	Agents slow, confused, or fail to spawn

Symptom	Likely Layer	Diagnostic
Agent does nothing, no files written	Platform	Silent success bug. Check `read_agent` status.
Agent writes files but returns no text	Platform	Known ~7-10% bug. Files verify success.
Agent does wrong task or skips steps	Governance	Check charter.md, spawn prompt, routing.md
Agent hallucinates file paths	LLM	Model-specific. Try different model.
Agent is slow or runs out of context	State	Check decisions.md + history.md sizes
Scribe skips archival	LLM + State	Haiku model + large file = skipped steps
Agent reads stale decisions	State	Scribe merge didn't run; check inbox/
Ceremony doesn't trigger	Governance	Check ceremonies.md conditions
Wrong agent gets the work	Governance	Check routing.md rules
`read_agent` returns server error	Platform	Context overflow after fan-out

Add diagnostics framework — squad doctor, blame attribution, troubleshooting #21

Description

Problem

Why This Matters

Proposed: Diagnostic Framework

1. squad doctor CLI Command

2. squad doctor --layer Focused Diagnostics

3. squad fix Auto-Repair

4. Diagnostic Logging in Coordinator

5. Blame Attribution Heuristics

6. User-Facing Troubleshooting Guide

Implementation Plan

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `squad doctor` CLI Command

2. `squad doctor --layer` Focused Diagnostics

3. `squad fix` Auto-Repair