Problem
When Squad misbehaves, it's extremely difficult to determine where the problem originates. There are at least 4 distinct failure layers, and symptoms often look identical across them:
| Layer |
Example Failure |
What It Looks Like |
Squad governance (squad.agent.md) |
Bad routing rule, missing spawn template field, stale ceremony config |
Agent does wrong thing or skips steps |
| Copilot CLI / platform |
Tool unavailable, background agent silent success, context overflow |
Agent appears to do nothing, partial results |
| LLM behavior |
Model ignores instructions, hallucinates file paths, skips archival step |
Agent does work but incorrectly or incompletely |
User .squad/ state |
Bloated decisions.md, corrupt registry.json, stale history, missing charter |
Agents slow, confused, or fail to spawn |
Why This Matters
Users (and the coordinator itself) currently have no systematic way to diagnose issues. When something goes wrong, the debugging process is:
- User notices bad behavior
- User guesses which layer caused it
- User manually inspects files, logs, or re-runs with different prompts
- Repeat until fixed (or give up)
This is especially painful because:
- LLM non-determinism means the same input can succeed or fail across runs
- Silent success bugs (~7-10% of background spawns) mean agents complete file writes but return no text
- Context overflow after multi-agent fan-out causes server error retry loops
- Bloated state files (decisions.md at 145KB) cause subtle degradation, not hard failures
Proposed: Diagnostic Framework
1. squad doctor CLI Command
A health check command that inspects .squad/ state and reports issues:
Checks:
Output:
🏥 Squad Doctor — Health Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Team roster: 6 members found
✅ Agent directories: all present
⚠️ decisions.md: 45KB (recommended: <20KB) — run "squad nap" to archive
✅ History files: all under 12KB
❌ decisions/inbox: 3 unmerged files — Scribe may have failed
✅ Casting state: valid
✅ .gitattributes: merge drivers configured
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Result: 1 error, 1 warning — run "squad fix" to auto-repair
2. squad doctor --layer Focused Diagnostics
Target a specific layer:
npx squad doctor --layer state # Check .squad/ file health
npx squad doctor --layer governance # Validate squad.agent.md structure
npx squad doctor --layer platform # Check tool availability, model access
npx squad doctor --layer all # Full diagnostic (default)
3. squad fix Auto-Repair
For issues that have safe, deterministic fixes:
Auto-fixable:
NOT auto-fixable (report only):
- Corrupt JSON files (need manual inspection)
- Roster/routing mismatches (need human decision)
- LLM behavioral issues (need prompt tuning)
4. Diagnostic Logging in Coordinator
Add structured diagnostic breadcrumbs to the orchestration log:
## Diagnostic Context
- **decisions.md size:** 23KB (⚠️ over 20KB threshold)
- **Spawn model:** claude-sonnet-4.5 (Layer 3: task-aware)
- **Agent charter:** 2.1KB ✅
- **Agent history:** 8.4KB ✅
- **Estimated spawn context:** ~18K tokens
- **Background mode:** yes
- **MCP tools available:** github ✅, azure ❌
This gives post-mortem investigators a snapshot of the state at spawn time.
5. Blame Attribution Heuristics
When something goes wrong, the coordinator (or a diagnostic agent) can use these signals:
| Symptom |
Likely Layer |
Diagnostic |
| Agent does nothing, no files written |
Platform |
Silent success bug. Check read_agent status. |
| Agent writes files but returns no text |
Platform |
Known ~7-10% bug. Files verify success. |
| Agent does wrong task or skips steps |
Governance |
Check charter.md, spawn prompt, routing.md |
| Agent hallucinates file paths |
LLM |
Model-specific. Try different model. |
| Agent is slow or runs out of context |
State |
Check decisions.md + history.md sizes |
| Scribe skips archival |
LLM + State |
Haiku model + large file = skipped steps |
| Agent reads stale decisions |
State |
Scribe merge didn't run; check inbox/ |
| Ceremony doesn't trigger |
Governance |
Check ceremonies.md conditions |
| Wrong agent gets the work |
Governance |
Check routing.md rules |
read_agent returns server error |
Platform |
Context overflow after fan-out |
6. User-Facing Troubleshooting Guide
A doc page or skill that helps users self-diagnose:
## Quick Troubleshooting
### "My agent did nothing"
1. Run `squad doctor` — check for state issues
2. Check orchestration log for the spawn entry
3. Was it a background spawn? Check `read_agent` result
4. Try re-running with `--verbose` logging
### "Decisions keep growing"
1. Run `squad doctor --layer state` — check decisions.md size
2. Run `squad nap` to archive old decisions
3. Check if Scribe is being spawned after work batches
4. Add a "Decisions Cleanup" ceremony for periodic maintenance
### "Agent gave wrong answer"
1. Check the agent's charter — does it cover this domain?
2. Check routing.md — was the right agent selected?
3. Check decisions.md — are there conflicting decisions?
4. Try a different model (bump to sonnet/opus for complex tasks)
Implementation Plan
- Phase 1:
squad doctor command — state health checks (most value, least risk)
- Phase 2: Diagnostic logging — add context breadcrumbs to orchestration log entries
- Phase 3:
squad fix command — auto-repair for safe issues
- Phase 4: Blame heuristics — teach the coordinator to self-diagnose
- Phase 5: Troubleshooting guide — user-facing docs
Labels
squad, enhancement, squad:lead
Problem
When Squad misbehaves, it's extremely difficult to determine where the problem originates. There are at least 4 distinct failure layers, and symptoms often look identical across them:
squad.agent.md).squad/statedecisions.md, corruptregistry.json, stale history, missing charterWhy This Matters
Users (and the coordinator itself) currently have no systematic way to diagnose issues. When something goes wrong, the debugging process is:
This is especially painful because:
Proposed: Diagnostic Framework
1.
squad doctorCLI CommandA health check command that inspects
.squad/state and reports issues:Checks:
.squad/team.mdexists and has## Memberssection with entriesagents/{name}/charter.mdagents/{name}/history.mddecisions.mdsize check (warn >20KB, error >50KB)decisions-archive.mdexists if decisions.md has been archived beforehistory.mdsize check (warn >12KB, error >30KB)casting/registry.jsonis valid JSON with all roster agentscasting/policy.jsonexists and is validdecisions/inbox/is empty (non-empty = Scribe didn't merge)config.jsonis valid JSON (if exists)agents/but not in roster)agents/)routing.mdreferences only agents that exist in roster.gitattributeshas union merge drivers for append-only files.squad/directory size (warn >1MB, error >5MB)Output:
2.
squad doctor --layerFocused DiagnosticsTarget a specific layer:
3.
squad fixAuto-RepairFor issues that have safe, deterministic fixes:
Auto-fixable:
decisions/inbox/by running Scribe mergecasting/state from existing agents.gitattributesmerge driversNOT auto-fixable (report only):
4. Diagnostic Logging in Coordinator
Add structured diagnostic breadcrumbs to the orchestration log:
This gives post-mortem investigators a snapshot of the state at spawn time.
5. Blame Attribution Heuristics
When something goes wrong, the coordinator (or a diagnostic agent) can use these signals:
read_agentstatus.read_agentreturns server error6. User-Facing Troubleshooting Guide
A doc page or skill that helps users self-diagnose:
Implementation Plan
squad doctorcommand — state health checks (most value, least risk)squad fixcommand — auto-repair for safe issuesLabels
squad,enhancement,squad:lead