Skip to content

feat(skills): add error-recovery skill -- standard agent failure recovery patterns#623

Merged
bradygaster merged 5 commits intobradygaster:devfrom
tamirdresher:feat/error-recovery-skill
Mar 28, 2026
Merged

feat(skills): add error-recovery skill -- standard agent failure recovery patterns#623
bradygaster merged 5 commits intobradygaster:devfrom
tamirdresher:feat/error-recovery-skill

Conversation

@tamirdresher
Copy link
Copy Markdown
Collaborator

Summary

Adds the \�rror-recovery\ skill — 5 standard recovery patterns every squad agent should apply when something fails.

Battle-tested in a 13-agent production deployment across 6 months of daily use.

What it does

Defines a clear playbook so agents adapt instead of just reporting failures:

Pattern When
Retry with Backoff Rate limits, API timeouts, transient errors
Fallback Alternatives Primary tool/approach unavailable
Diagnose-and-Fix Build, test, lint failures (max 3 cycles)
Escalate with Context Recovery exhausted, human judgment required
Graceful Degradation Non-critical step fails, task can still deliver value

Also includes a selection guide mapping failure types to the right pattern.

Design

  • Provider-agnostic — no external services required
  • Meant to be referenced in each agent's charter ## Error Recovery\ section
  • Complements \iterative-retrieval\ (cycle caps) and \session-recovery\ (restart scenarios)

Zero breaking changes.

bradygaster and others added 2 commits March 23, 2026 10:08
chore: promote dev to main (docs dark mode fix)
…ery patterns

5 recovery patterns for any squad agent:
- Retry with Backoff (transient failures, rate limits)
- Fallback Alternatives (tool/approach fails)
- Diagnose-and-Fix (build/test/lint errors)
- Escalate with Context (exhausted recovery)
- Graceful Degradation (non-critical step fails)

Includes selection guide mapping failure types to patterns.
Battle-tested in production multi-agent deployment.
Zero breaking changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bradygaster
Copy link
Copy Markdown
Owner

🔍 Squad Team Review

Reviewed by: Procedures (Prompt Architect) — Pass 3 of consensus ceremony
Action: Hold — proposal-first policy applies
Team Consensus: 5/5 AGREE

Review Summary

This PR adds error-recovery mechanisms that enable skills to gracefully handle and recover from failures.

Process Note

Per our team's documented proposal-first policy (.squad/decisions.md): "Meaningful changes require a proposal in docs/proposals/ before execution." This is a feature-class change that needs a proposal document before merge.

What's Needed

  1. Create a proposal in docs/proposals/ describing the skill's purpose, integration points, and expected behavior
  2. Retarget this PR from main to dev (our staging branch)
  3. Once proposal is reviewed and approved, this PR can proceed

Thank you for the contribution! The implementation looks solid — we just need the process alignment. 🙏

⏸️ Action: HOLD — awaiting proposal + retarget to dev

Copy link
Copy Markdown
Owner

@bradygaster bradygaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIDO Quality Review — PR #623: error-recovery skill

Verdict: ⚠️ NEEDS CHANGES

Blocking Issue

❌ Wrong package names in changeset: The .changeset/error-recovery-skill.md uses squad-cli and squad-sdk instead of @bradygaster/squad-cli and @bradygaster/squad-sdk. This will cause the changeset to fail — it won't match any workspace packages. All other existing changesets in this repo use the scoped @bradygaster/ prefix.

Fix: Change the changeset frontmatter to:
\\yaml

"@bradygaster/squad-cli": minor
"@bradygaster/squad-sdk": minor

\\

Non-Blocking Notes

  • Target branch: Targets main — other skill PRs in this batch target main too, but #625 targets dev. Consider aligning to dev.
  • Frontmatter deviations: Has license: MIT field (not in template) and missing source field. Non-breaking but inconsistent with other skills.
  • Content quality: Excellent. The 5 recovery patterns (retry, fallback, diagnose, escalate, degrade) are practical and well-structured. The selection guide table is a nice touch.

Once the changeset package names are fixed, this is ready to merge.

chrislomonico pushed a commit to clomonico/squad that referenced this pull request Mar 26, 2026
…ster#605, bradygaster#606, bradygaster#607) (bradygaster#623)

* test: add first-run gating tests (issue bradygaster#607)

25 tests across 6 categories enforcing Init Mode gating:
- Banner renders exactly once (not duplicated)
- First-run hint appears on initial session only
- Console output contains no raw Node warnings
- 'Your squad is assembled' requires non-empty roster
- Session-scoped Static keys prevent cross-session collisions
- Terminal clear runs before Ink render

Closes bradygaster#607

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: standardize elapsed time placement inline after agent messages (bradygaster#605)

Elapsed time annotations were inconsistent — the Static scrollback block in
App.tsx rendered completed agent messages without any duration, while
MessageStream had duration code that was dead (receives messages=[]).

Changes:
- Export formatDuration from MessageStream.tsx for shared use
- Add inline (duration) display to App.tsx Static block for agent messages
- Duration computed by finding preceding user message timestamp
- Format: inline dimColor after message content, e.g. (4.4s)

Closes bradygaster#605

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: update cheritto history and decision for bradygaster#605

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(shell): remove duplicate prompt hint that conflicts with header (bradygaster#606)

The InputPrompt placeholder showed 'Type @agent or /help' for the first 5
messages, duplicating guidance already present in the header banner. This
created visual noise with two competing prompt elements.

Consolidated to two hint tiers: tab/history (messageCount < 10) and
command hints (>= 10). The header banner remains the single source for
@agent and /help guidance.

Closes bradygaster#606

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: update kovash history and decision for bradygaster#606 prompt consolidation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher tamirdresher changed the base branch from main to dev March 26, 2026 08:16
@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Fixed per FIDO review: corrected changeset package names from squad-cli/squad-sdk to @bradygaster/squad-cli/@bradygaster/squad-sdk. Also retargeting to dev branch.

@bradygaster
Copy link
Copy Markdown
Owner

Thank you for this work, Tamir! 🚀

To get this PR ready for team review and merge, we need two quick adjustments:

  1. *Retarget to \dev* — All feature work merges to the \dev\ branch first (not \main). You can change the base branch in the PR settings.

  2. Add a proposal document — Feature-class skills need a brief proposal in \docs/proposals/\ before merging. Include:

    • What the skill does
    • Why it's needed
    • How it fits into our existing skill architecture

Once those are in place, the team will take another look and we can get this merged!

Thanks for your continued contributions to Squad 👍

@tamirdresher tamirdresher force-pushed the feat/error-recovery-skill branch from f8b8b92 to 868810b Compare March 26, 2026 08:29
@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Fixed per FIDO review: corrected changeset package names from squad-cli/squad-sdk to @bradygaster/squad-cli/@bradygaster/squad-sdk. Also retargeted this PR to the dev branch.

@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Addressed both review items: (1) Fixed changeset package names to @bradygaster/squad-cli and @bradygaster/squad-sdk. (2) Added docs/proposals/error-recovery.md proposal document. PR already retargeted to dev.

Copilot AI review requested due to automatic review settings March 28, 2026 13:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “error-recovery” skill to the Squad skill template library, defining standard patterns agents should apply when tasks/tools fail, plus a supporting proposal doc and release changeset.

Changes:

  • Add error-recovery skill template to both SDK and CLI template trees.
  • Add a proposal document describing the motivation/approach for the new skill.
  • Add a changeset bumping @bradygaster/squad-cli and @bradygaster/squad-sdk minor versions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
packages/squad-sdk/templates/skills/error-recovery/SKILL.md New SDK-distributed skill content for error recovery patterns.
packages/squad-cli/templates/skills/error-recovery/SKILL.md New CLI-distributed skill content for error recovery patterns.
docs/proposals/error-recovery.md Proposal write-up for the new skill (motivation, approach, risks).
.changeset/error-recovery-skill.md Changeset to release the new skill in CLI/SDK packages.

description: "Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure."
domain: "reliability, agent-coordination"
confidence: "high"
license: MIT
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skill frontmatter deviates from the convention used by other skills: it includes license: MIT but is missing the source: field that’s present across existing skill templates. Consider adding a source: entry (e.g., earned/manual/extracted) and dropping license if it’s not consumed anywhere, to keep metadata consistent and avoid unused fields.

Suggested change
license: MIT
source: manual

Copilot uses AI. Check for mistakes.
description: "Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure."
domain: "reliability, agent-coordination"
confidence: "high"
license: MIT
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skill frontmatter deviates from the convention used by other skills: it includes license: MIT but is missing the source: field that’s present across existing skill templates. Consider adding a source: entry (e.g., earned/manual/extracted) and dropping license if it’s not consumed anywhere, to keep metadata consistent and avoid unused fields.

Suggested change
license: MIT
source: manual

Copilot uses AI. Check for mistakes.

## Fit with Existing Architecture

- **Complements** existing gent-conduct skill (which covers behavior) — this skill covers failure states
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a non-printable/control character before gent-conduct in this line, which will render incorrectly in Markdown and makes the skill reference harder to read/search. Replace it with plain text agent-conduct.

Suggested change
- **Complements** existing gent-conduct skill (which covers behavior) — this skill covers failure states
- **Complements** existing agent-conduct skill (which covers behavior) — this skill covers failure states

Copilot uses AI. Check for mistakes.
| **degrade** | Full functionality unavailable — deliver partial result with caveat |

The skill provides a selection guide table mapping error symptoms to the appropriate pattern,
plus prompt templates for each pattern that agents can use in their reasoning.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal says the skill includes “prompt templates for each pattern,” but the new SKILL.md content only contains narrative guidance/examples (no reusable prompt templates). Either add the prompt-template sections to the skill files or adjust this proposal text so it matches what’s actually being shipped.

Suggested change
plus prompt templates for each pattern that agents can use in their reasoning.
plus narrative guidance and example prompts for each pattern that agents can adapt in their reasoning.

Copilot uses AI. Check for mistakes.
@bradygaster bradygaster merged commit 9c3156a into bradygaster:dev Mar 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants