Skip to content

feat(watch): port ralph-watch.ps1 resilience features (#743)#744

Merged
tamirdresher merged 14 commits intobradygaster:insiderfrom
tamirdresher:squad/743-watch-parity
Apr 2, 2026
Merged

feat(watch): port ralph-watch.ps1 resilience features (#743)#744
tamirdresher merged 14 commits intobradygaster:insiderfrom
tamirdresher:squad/743-watch-parity

Conversation

@tamirdresher
Copy link
Copy Markdown
Collaborator

Closes #743

Summary

Ports battle-tested resilience, observability, and execution quality features from ralph-watch.ps1 into the squad-cli watch TypeScript command. All features are org-agnostic, config-driven, and opt-in — without new flags, behavior is identical to current.

New Modules (10 files in \capabilities/)

P0 — Reliability

  • circuit-breaker.ts — \ModelCircuitBreaker\ class: tracks model failures, auto-fallback through configurable chain, cooldown timer, state in .squad/ralph-circuit-breaker.json\
  • health-check.ts — Pre-round watchdog: verify gh auth, disk space, circuit breaker state, branch drift. Includes PAT extraction from git remote URL as auth fallback.
  • post-failure.ts — Tiered self-healing: Tier 1 (reset CB), Tier 2 (re-auth), Tier 3 (git pull), Tier 4 (extended pause + alert)

P1 — Execution Quality

  • priority.ts — Issue scoring: P0-P3 labels, age bonus, staleness, bug/size modifiers.
    ankIssues()\ returns sorted list.
  • machine-capabilities.ts — Match
    eeds:*\ labels to local machine. Auto-detect GPU, Docker, Playwright, etc.
  • stale-reclaim.ts — Reclaim issues assigned >24h with no activity, unassign and re-queue.
  • budget-check.ts — Configurable max-issues-per-round gate.

P2 — Observability

  • heartbeat.ts — Write .squad/ralph-heartbeat.json\ every round + structured log with rotation.
  • lockfile.ts — Per-repo lock with PID + timestamp + stale detection. Prevents duplicate watch instances.
  • webhook-alerts.ts — POST to any webhook (Slack, Discord, Teams) on consecutive failures above threshold.

New CLI Flags

Flag Description Default
--webhook-url \ Webhook URL for failure alerts
--alert-threshold \ Consecutive failures before alert 3
--max-budget \ Max issues per round 5
--capabilities \ Machine caps (e.g., gpu,docker) auto-detect
--health-check\ Enable pre-round health check off
--stale-reclaim\ Enable stale work reclaim off
--heartbeat\ Enable heartbeat + structured log off
--webhook-alerts\ Enable webhook alerts off

Backward Compatibility

  • Zero new type errors introduced (baseline: 61 pre-existing FSStorageProvider errors)
  • Without any new flags, behavior is identical to current watch command
  • All features plug into the existing capability registry system

diberry and others added 8 commits March 31, 2026 16:19
Add concurrency blocks with cancel-in-progress to squad-ci, squad-heartbeat,
squad-triage, squad-label-enforce, and squad-issue-assign workflows.

Scope: .github/workflows/ only (squad repo CI).
Template workflows for customer repos are a separate product concern.

Test: 20 assertions covering all 5 workflows.

Refs: diberry#122

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Audited all contributor-facing files on dev branch and identified 8 gaps
in external contributor experience. Proposes 7 deliverables prioritized
by maintainer time savings:

P1: Issue templates, good-first-issue curation, .squad/ explainer
P2: CODE_OF_CONDUCT.md, contributor FAQ, README contributing section
P3: SECURITY.md typo fix

Goal: contributors self-serve from docs instead of asking Brady/Tamir
the same questions repeatedly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-guide-proposal

docs: contributor guide improvements proposal
…e3-item-3-1

devops(ci): add concurrency controls to 5 workflows (Phase 3 item A1)
Fixes off-by-N error in nap command's decision archival where newline
separators between entries weren't counted in the byte budget, causing
archives to exceed the target size.

Closes bradygaster#123

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-size-calc

fix(nap): account for separator newlines in decision archival budget
Port battle-tested features from ralph-watch.ps1 into squad-cli watch.
All features are org-agnostic, config-driven, and backward compatible.

P0 Reliability:
- ModelCircuitBreaker: model-level fallback with cooldown + state persistence
- Rate limit detection from API headers, predictive circuit opening
- Pre-round health check (auth, disk space, branch drift, CB validation)
- Post-failure remediation with tiered self-healing (reset CB, re-auth, git pull)

P1 Execution Quality:
- Issue priority scoring (P0-P3 labels, age, staleness, size, bug bonus)
- Machine capability checking (needs:* labels vs local probes)
- Stale work reclaim (unassign issues idle >24h)
- Budget check (max issues per round)

P2 Observability:
- Heartbeat file (.squad/ralph-heartbeat.json) written every round
- Structured log (.squad/ralph-watch.log) with rotation
- Per-repo lockfile with PID, stale detection
- Webhook alerts on consecutive failures (--webhook-url, --alert-threshold)

CLI flags: --webhook-url, --alert-threshold, --max-budget, --capabilities

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher tamirdresher changed the base branch from dev to insider April 2, 2026 13:42
Copilot and others added 2 commits April 2, 2026 16:45
bradygaster#743)

Ported from ralph-watch.ps1 New-RatePool / Read-RatePool / Write-RatePool /
Update-RatePool budget coordination logic.

- New rate-pool.ts: tracks API call budget per interval window with
  file-based advisory locking (atomic temp+rename writes, retry on
  contention). Multiple Ralph instances share .squad/ralph-rate-pool.json.
- execute.ts: acquireSlot() gates each issue before agent spawn;
  releaseSlot() fires in finally block. Budget-exhausted issues are
  skipped with a log line, not failed.
- config.ts: watch.ratePool.maxCallsPerInterval (default 50) and
  watch.ratePool.intervalSeconds (default 600) wired through the
  three-tier merge (defaults < file < CLI).
- index.ts: re-exports RatePool and types from the capability barrel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher tamirdresher added the skip-version-check Skip prerelease version guard for this PR label Apr 2, 2026
The stale package-lock.json resolved @bradygaster/squad-sdk from the
npm registry instead of the workspace link, causing workspace-integrity
and test (rollup native binary) CI failures.

Delete and regenerate the lockfile so npm resolves squad-sdk via the
workspace symlink.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Bug Found During E2E Testing (Issue tamirdresher/tamresearch1#2034)

Circuit Breaker: cooldownMinutes: 0 silently ignored

The constructor uses || defaultValue which treats 0 as falsy:

// Current (buggy):
this.cooldownMinutes = options.cooldownMinutes || 5;

// Fix:
this.cooldownMinutes = options.cooldownMinutes ?? 5;

Same pattern likely affects other numeric config values that accept 0 as valid input (e.g., maxFailures, healthCheckIntervalMinutes).

Impact: Users who explicitly set cooldownMinutes: 0 (immediate retry after circuit reset) silently get the default (5 min) instead.

Severity: Low — edge case, but a correctness bug. One-line fix per affected field.


Found by B'Elanna (Tamir's Squad) — 47/47 other resilience tests pass
Full results: https://github.com/tamirdresher/tamresearch1/issues/2034

Copilot and others added 3 commits April 2, 2026 17:34
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…inks

- Regenerated package-lock.json from clean state to fix two CI failures:
  1. workspace-integrity: stale registry entry for @bradygaster/squad-sdk
     resolved to npmjs.org instead of local workspace link
  2. test: lockfile missing @rollup/rollup-linux-x64-gnu (only had win32
     platform entries)

- Fresh npm install produces lockfile with all platform optional deps
  and correct workspace symlinks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gaster#739)

The watch/triage command in cli-entry.ts previously only parsed --interval,
silently ignoring all other flags (--monitor-teams, --execute, --board, etc.).

Our bradygaster#743 watch-parity work already added parsing for all registered capability
flags and new resilience flags. This commit completes the fix by adding unknown
flag detection: any --flag not in the known set now prints a warning instead of
being silently dropped.

Closes bradygaster#739
Refs bradygaster#743

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Status Update

CI: 10/10 green

All feedback from @diberry addressed:

Ready for review@diberry @bradygaster please approve when ready and we will do an insider release.

— Tamir's Squad 🤖

@tamirdresher tamirdresher marked this pull request as ready for review April 2, 2026 15:35
@tamirdresher tamirdresher merged commit b81b20d into bradygaster:insider Apr 2, 2026
10 checks passed
@diberry diberry self-requested a review April 2, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-version-check Skip prerelease version guard for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(watch): Port ralph-watch.ps1 resilience features into squad-cli watch (org-agnostic)

3 participants