Skip to content

Add diagnostics framework — squad doctor, blame attribution, troubleshooting #21

@diberry

Description

@diberry

Problem

When Squad misbehaves, it's extremely difficult to determine where the problem originates. There are at least 4 distinct failure layers, and symptoms often look identical across them:

Layer Example Failure What It Looks Like
Squad governance (squad.agent.md) Bad routing rule, missing spawn template field, stale ceremony config Agent does wrong thing or skips steps
Copilot CLI / platform Tool unavailable, background agent silent success, context overflow Agent appears to do nothing, partial results
LLM behavior Model ignores instructions, hallucinates file paths, skips archival step Agent does work but incorrectly or incompletely
User .squad/ state Bloated decisions.md, corrupt registry.json, stale history, missing charter Agents slow, confused, or fail to spawn

Why This Matters

Users (and the coordinator itself) currently have no systematic way to diagnose issues. When something goes wrong, the debugging process is:

  1. User notices bad behavior
  2. User guesses which layer caused it
  3. User manually inspects files, logs, or re-runs with different prompts
  4. Repeat until fixed (or give up)

This is especially painful because:

  • LLM non-determinism means the same input can succeed or fail across runs
  • Silent success bugs (~7-10% of background spawns) mean agents complete file writes but return no text
  • Context overflow after multi-agent fan-out causes server error retry loops
  • Bloated state files (decisions.md at 145KB) cause subtle degradation, not hard failures

Proposed: Diagnostic Framework

1. squad doctor CLI Command

A health check command that inspects .squad/ state and reports issues:

npx squad doctor

Checks:

  • .squad/team.md exists and has ## Members section with entries
  • All agents in roster have matching agents/{name}/charter.md
  • All agents in roster have matching agents/{name}/history.md
  • decisions.md size check (warn >20KB, error >50KB)
  • decisions-archive.md exists if decisions.md has been archived before
  • Each history.md size check (warn >12KB, error >30KB)
  • casting/registry.json is valid JSON with all roster agents
  • casting/policy.json exists and is valid
  • decisions/inbox/ is empty (non-empty = Scribe didn't merge)
  • config.json is valid JSON (if exists)
  • No orphaned agent directories (in agents/ but not in roster)
  • No missing agent directories (in roster but not in agents/)
  • routing.md references only agents that exist in roster
  • .gitattributes has union merge drivers for append-only files
  • Total .squad/ directory size (warn >1MB, error >5MB)

Output:

🏥 Squad Doctor — Health Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Team roster: 6 members found
✅ Agent directories: all present
⚠️  decisions.md: 45KB (recommended: <20KB) — run "squad nap" to archive
✅ History files: all under 12KB
❌ decisions/inbox: 3 unmerged files — Scribe may have failed
✅ Casting state: valid
✅ .gitattributes: merge drivers configured
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Result: 1 error, 1 warning — run "squad fix" to auto-repair

2. squad doctor --layer Focused Diagnostics

Target a specific layer:

npx squad doctor --layer state     # Check .squad/ file health
npx squad doctor --layer governance # Validate squad.agent.md structure
npx squad doctor --layer platform   # Check tool availability, model access
npx squad doctor --layer all        # Full diagnostic (default)

3. squad fix Auto-Repair

For issues that have safe, deterministic fixes:

npx squad fix

Auto-fixable:

  • Empty decisions/inbox/ by running Scribe merge
  • Archive decisions.md entries >30 days old
  • Summarize history.md files >12KB
  • Create missing agent directories from roster
  • Initialize missing casting/ state from existing agents
  • Add missing .gitattributes merge drivers

NOT auto-fixable (report only):

  • Corrupt JSON files (need manual inspection)
  • Roster/routing mismatches (need human decision)
  • LLM behavioral issues (need prompt tuning)

4. Diagnostic Logging in Coordinator

Add structured diagnostic breadcrumbs to the orchestration log:

## Diagnostic Context
- **decisions.md size:** 23KB (⚠️ over 20KB threshold)
- **Spawn model:** claude-sonnet-4.5 (Layer 3: task-aware)
- **Agent charter:** 2.1KB ✅
- **Agent history:** 8.4KB ✅
- **Estimated spawn context:** ~18K tokens
- **Background mode:** yes
- **MCP tools available:** github ✅, azure ❌

This gives post-mortem investigators a snapshot of the state at spawn time.

5. Blame Attribution Heuristics

When something goes wrong, the coordinator (or a diagnostic agent) can use these signals:

Symptom Likely Layer Diagnostic
Agent does nothing, no files written Platform Silent success bug. Check read_agent status.
Agent writes files but returns no text Platform Known ~7-10% bug. Files verify success.
Agent does wrong task or skips steps Governance Check charter.md, spawn prompt, routing.md
Agent hallucinates file paths LLM Model-specific. Try different model.
Agent is slow or runs out of context State Check decisions.md + history.md sizes
Scribe skips archival LLM + State Haiku model + large file = skipped steps
Agent reads stale decisions State Scribe merge didn't run; check inbox/
Ceremony doesn't trigger Governance Check ceremonies.md conditions
Wrong agent gets the work Governance Check routing.md rules
read_agent returns server error Platform Context overflow after fan-out

6. User-Facing Troubleshooting Guide

A doc page or skill that helps users self-diagnose:

## Quick Troubleshooting

### "My agent did nothing"
1. Run `squad doctor` — check for state issues
2. Check orchestration log for the spawn entry
3. Was it a background spawn? Check `read_agent` result
4. Try re-running with `--verbose` logging

### "Decisions keep growing"  
1. Run `squad doctor --layer state` — check decisions.md size
2. Run `squad nap` to archive old decisions
3. Check if Scribe is being spawned after work batches
4. Add a "Decisions Cleanup" ceremony for periodic maintenance

### "Agent gave wrong answer"
1. Check the agent's charter — does it cover this domain?
2. Check routing.md — was the right agent selected?
3. Check decisions.md — are there conflicting decisions?
4. Try a different model (bump to sonnet/opus for complex tasks)

Implementation Plan

  1. Phase 1: squad doctor command — state health checks (most value, least risk)
  2. Phase 2: Diagnostic logging — add context breadcrumbs to orchestration log entries
  3. Phase 3: squad fix command — auto-repair for safe issues
  4. Phase 4: Blame heuristics — teach the coordinator to self-diagnose
  5. Phase 5: Troubleshooting guide — user-facing docs

Labels

squad, enhancement, squad:lead

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgo:needs-researchNeeds investigationsquadSquad triage inbox — Lead will assign to a membersquad:fidoAssigned to FIDO (Quality Owner)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions