You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Phase 3: Deep Analysis — 30 Items Across 6 Categories
Category A: Reliability & Resilience
A1. Concurrency controls missing on most workflows
Problem: Only squad-docs.yml has concurrency settings. All other workflows can run multiple instances simultaneously — if 3 PRs push in quick succession, 3 squad-ci runs compete for resources.
Impact: HIGH — resource waste, potential race conditions on label/triage workflows
Problem:squad-promote.yml uses --no-commit --no-ff -X theirs to merge dev→preview. This auto-resolves ALL conflicts by taking dev's version without human review.
Impact: HIGH — bad code can silently override preview/main
Effort: MEDIUM
Fix: Switch to --no-ff without -X theirs. Fail on conflicts and require manual resolution. Add dry-run diff output showing exactly what would change.
Priority: P1
Owner: Procedures + Flight
A3. No rollback mechanism for docs deployments
Problem:squad-docs.yml deploys directly to GitHub Pages with no build output validation and no rollback. Broken docs go live instantly.
Impact: MEDIUM — user-facing docs broken until next push to main
Effort: MEDIUM
Fix: Add post-deploy smoke check (HTTP status on key pages). Store previous deployment artifact for rollback. Consider deployment environments with approval gates.
Priority: P2
Owner: PAO + Booster
A4. Insider publish creates inconsistent state on partial failure
Problem:squad-insider-publish.yml publishes SDK and CLI sequentially. If CLI publish fails after SDK succeeds, npm has mismatched SDK/CLI versions on the insider tag.
Impact: MEDIUM — insider users get broken dependency graph
Effort: MEDIUM
Fix: Add post-publish verification step. If CLI fails, emit clear error with manual recovery instructions. Consider publish-both-or-neither pattern with npm unpublish fallback (risky but atomic).
Priority: P2
Owner: Surgeon + Booster
A5. Release/publish workflow coordination gap
Problem:squad-release.yml (push to main) and squad-npm-publish.yml (release event) can trigger near-simultaneously. Release creates tag + GitHub Release, which triggers npm publish. No explicit handoff or dependency check.
Impact: MEDIUM — timing-sensitive; if release workflow slow, npm publish may start before tag exists
Effort: LOW
Fix: Add workflow_run trigger on squad-npm-publish.yml that waits for squad-release.yml completion instead of relying on release event timing.
Priority: P2
Owner: Booster
A6. Tag deduplication gaps in release workflows
Problem: Both squad-release.yml and squad-insider-release.yml create tags. If a tag already exists, git push fails. squad-release.yml checks for existing tags, but squad-insider-release.yml does not.
Impact: LOW — insider releases would fail on re-push; not catastrophic
Effort: LOW
Fix: Add tag existence check to squad-insider-release.yml matching the pattern in squad-release.yml.
Priority: P3
Owner: Booster
Category B: Testing Depth
B1. No cross-platform CI (Windows/macOS)
Problem: All CI runs on ubuntu-latest only. SDK/CLI published on npm for all platforms but never tested on Windows or macOS. Known platform-specific skips exist in tests (storage-provider, resolution, doctor).
Impact: HIGH — flakes erode trust in CI, developers ignore failures
Effort: HIGH (4-6 hours)
Fix: Add vitest --reporter=json output + post-test analysis script. Track per-test pass/fail across last N runs. Quarantine tests with >20% but <100% failure rate. Post flake report to PR comment.
B4. Acceptance tests (Gherkin BDD) not visible as separate CI gate
Problem: 8 Gherkin feature files exist in test/acceptance/ with full step definitions and terminal harness. But they run as part of the monolithic npm test — no visibility into acceptance test health vs unit test health.
Impact: LOW — informational gap, not a failure gap
Effort: LOW
Fix: Add separate vitest workspace config for acceptance tests. Report as distinct CI step in squad-ci.yml.
Priority: P3
Owner: FIDO
B5. Sample tests only run when SDK source changes
Problem:samples-build job in squad-ci.yml is gated on packages/squad-sdk/src/ changes. If a sample's own code breaks (e.g., bad import), CI won't catch it unless SDK is also changed.
Impact: MEDIUM — sample rot goes undetected
Effort: LOW
Fix: Expand path filter to include samples/** in trigger condition. Also add weekly cron run of samples-build.
Priority: P2
Owner: Booster
B6. No test sharding for faster execution
Problem: All 196 test files run in a single job (~4-5 min). As test count grows, this becomes a bottleneck.
Impact: LOW (currently fast enough)
Effort: MEDIUM
Fix: Use vitest --shard=1/N flag with matrix strategy when test count exceeds threshold (~300 files). Not needed now.
Priority: P3 (future)
Owner: Booster
B7. E2E CLI smoke test expansion
Problem:squad-npm-publish.yml has CLI packaging smoke test, but it only validates npm pack. No test of actual global install + command execution.
Impact: MEDIUM — global install breakage ships undetected
Effort: LOW (1-2 hours)
Fix: Extend smoke test to: npm install -g @bradygaster/squad-cli, squad --version, squad init --help, squad status.
Priority: P2
Owner: Booster + EECOM
Category C: Security & Supply Chain
C1. Third-party actions not pinned to SHAs
Problem: All first-party actions (actions/checkout, actions/setup-node) use @v4 tags. Third-party actions in squad-docs-links.yml use tags: lycheeverse/lychee-action@v2, peter-evans/create-issue-from-file@v5. Tags can be force-pushed by upstream maintainers.
Impact: HIGH — supply chain attack vector
Effort: LOW
Fix: Pin ALL actions (first-party and third-party) to full SHA commits. Use npx pin-github-action or Dependabot's action pinning feature. Example: actions/checkout@b4ffde65f... instead of @v4.
Priority: P1
Owner: Booster
C2. No dependency review on PRs
Problem: No Dependabot, Renovate, or dependency-review-action configured. New dependencies added via PR are not automatically audited for vulnerabilities or license issues.
Impact: HIGH — malicious/vulnerable transitive deps go unreviewed
Effort: LOW
Fix: Add .github/dependabot.yml for automated PR creation on dependency updates. Add actions/dependency-review-action@v4 to PR CI pipeline. Enable GitHub's built-in dependency graph.
Priority: P1
Owner: Booster + FIDO
C3. No SAST/security scanning in CI
Problem: No CodeQL, Snyk, SonarQube, or npm audit in CI. TypeScript strict mode and ESLint are the only static analysis.
Impact: HIGH — security regressions, known CVEs in deps go undetected
Effort: MEDIUM (3-4 hours)
Fix: Add npm audit --audit-level=moderate as CI gate. Add CodeQL analysis workflow for JavaScript/TypeScript. Consider github/codeql-action@v3 with auto-build.
Priority: P1
Owner: FIDO + Booster
C4. No CODEOWNERS file
Problem: No .github/CODEOWNERS. PRs don't auto-assign reviewers based on file paths. Anyone can modify workflow files without required review from CI owners.
Impact: MEDIUM — workflow changes not automatically routed to right reviewer
C5. Team.md parsing is a fragile injection surface
Problem: 6 workflows parse .squad/team.md or .ai-team/team.md using string splitting on |. The parsed values are used in GitHub API calls (label names, comments, assignments). A malicious team.md edit could inject unexpected values.
Impact: MEDIUM — limited to repos with write access, but violates defense-in-depth
Effort: MEDIUM
Fix: Extract team.md parsing into a single composite action with input validation (A7 recommendation). Sanitize all parsed values before use in API calls. Add regex validation for member names.
Priority: P2
Owner: Booster + RETRO
C6. No license compliance scanning
Problem: No automated check for license compatibility of npm dependencies. SDK/CLI are published packages — transitive dependency licenses matter.
Impact: MEDIUM — legal compliance risk for downstream consumers
Effort: LOW
Fix: Add license-checker or licensee check to CI. Maintain allowlist of approved licenses (MIT, Apache-2.0, ISC, BSD-*).
Priority: P3
Owner: Procedures
Category D: Performance & Cost
D1. No build artifact caching between jobs
Problem: Multiple jobs in squad-ci.yml run npm ci and npm run build independently. The test, exports-map-check, export-smoke-test, and samples-build jobs all rebuild from scratch.
Impact: MEDIUM — ~2 min wasted per run on redundant builds
Effort: MEDIUM
Fix: Use actions/cache or actions/upload-artifact + actions/download-artifact to share build output between jobs. Build once in a build job, download in dependent jobs.
Priority: P2
Owner: Booster
D2. Playwright browser install is unconditional
Problem:squad-ci.yml always installs Playwright browsers (~30s) even if no Playwright tests would run.
Impact: LOW — 30 seconds per run
Effort: LOW
Fix: Already identified in original analysis (item 2.4). Cache Playwright browsers. Skip install if only doc changes.
Priority: P3
Owner: Booster
D3. action_required conclusions need investigation
Problem: 4 out of 22 recent Squad CI runs show action_required conclusion. These are likely first-time fork contributors needing workflow approval. But the pattern creates noise in CI history.
Impact: LOW — informational
Effort: LOW
Fix: Document that action_required is expected for fork PRs. Consider auto-approve for known contributors. Add to CI runbook.
Priority: P3
Owner: Procedures
D4. Expand Node version matrix for insider builds
Problem:squad-insider-publish.yml uses a matrix strategy but only tests Node 20. Main CI only tests Node 22. Node 18 (LTS until April 2025) is untested.
Impact: MEDIUM — Node version incompatibilities ship undetected
Effort: LOW
Fix: Expand insider matrix to [20, 22]. Add Node 22 validation to insider builds (currently only Node 20). Align with engines field (>=22.5.0).
Priority: P2
Owner: Booster
Category E: Developer Experience
E1. No .nvmrc or .node-version file
Problem:package.json specifies engines.node: ">=22.5.0" but there's no .nvmrc or .node-version file. Contributors must read package.json to know which Node version to use.
Impact: LOW — friction for new contributors
Effort: LOW (1 min)
Fix: Create .nvmrc with 22 or .node-version with 22.5.0.
Priority: P3
Owner: Booster
E2. No local CI simulation (act config)
Problem: Workflows have excellent inline documentation for running individual gates locally, but there's no way to run the full CI pipeline locally. No act config, no Makefile.
Impact: MEDIUM — contributors push to test, wasting CI minutes and time
Effort: MEDIUM (2-3 hours)
Fix: Add Makefile or scripts/ci-local.sh that runs all CI gates in sequence: build, test, lint, changelog check, exports check, samples build. Optional: add .actrc for GitHub Actions local runner.
Priority: P2
Owner: Booster + EECOM
E3. CI failure messages could be more actionable
Problem: Some CI gates output generic failure messages. For example, the large deletion guard just says "PR deletes more than 50 files" without showing which files. The exports map check doesn't show which export is missing.
Impact: LOW — developers can dig into logs, but friction is real
Effort: LOW
Fix: Improve error messages in each gate to include: what failed, what the expected state is, and how to fix it. Use ::error file=... annotations for file-specific failures.
Priority: P3
Owner: Booster
E4. Status check names not consistently prefixed
Problem: CI jobs have names like test, docs-quality, changelog-gate — but in the GitHub Checks UI, they appear under the workflow name "Squad CI". The individual job names aren't immediately recognizable in the PR status list.
Impact: LOW — cosmetic
Effort: LOW
Fix: Prefix job names consistently: "Squad CI / Build & Test", "Squad CI / Changelog Gate", etc. This makes the Checks list scannable.
Priority: P3
Owner: Booster
Category F: Observability & Monitoring
F1. No CI metrics collection
Problem: No tracking of CI pass rate, duration trends, or flake rate over time. Can't answer "is CI getting slower?" or "what's our flake rate this month?"
Impact: MEDIUM — can't detect gradual degradation
Effort: MEDIUM (3-4 hours)
Fix: Add post-run step that writes metrics to a JSON artifact. Create weekly cron workflow that aggregates metrics via gh run list API and posts summary to a discussion or issue.
Priority: P2
Owner: Booster + Flight
F2. No alerts for CI degradation
Problem: If CI starts failing consistently, no one is notified. The team discovers failures when they check GitHub.
Impact: MEDIUM — delayed response to CI breakage
Effort: LOW
Fix: Add a daily/weekly cron workflow that checks recent CI pass rate via gh run list. If pass rate drops below 80%, create an issue or send notification. Could integrate with existing Ralph heartbeat.
Priority: P2
Owner: Booster + Ralph
F3. No scheduled link checking
Problem:squad-docs-links.yml is manual-only. External links rot silently.
Problem: No tracking of build/test duration. If a PR adds 60 seconds to test time, no one notices until it accumulates.
Impact: LOW (CI is fast now)
Effort: MEDIUM
Fix: Record build/test duration in CI artifacts. Add threshold alert if total CI time exceeds 8 minutes (currently ~4-5 min). Post warning to PR if duration increases >20%.
Priority: P3 (future)
Owner: Booster
Summary Table
#
Category
Item
Impact
Effort
Priority
Owner
A1
Reliability
Concurrency controls
HIGH
LOW
P1
Booster
A2
Reliability
Promote merge strategy fix
HIGH
MEDIUM
P1
Procedures + Flight
A3
Reliability
Docs deployment rollback
MEDIUM
MEDIUM
P2
PAO + Booster
A4
Reliability
Insider publish atomicity
MEDIUM
MEDIUM
P2
Surgeon + Booster
A5
Reliability
Release/publish coordination
MEDIUM
LOW
P2
Booster
A6
Reliability
Insider tag deduplication
LOW
LOW
P3
Booster
B1
Testing
Cross-platform CI
HIGH
MEDIUM
P1
Booster + FIDO
B2
Testing
Docker skip guards
MEDIUM
LOW
P1
FIDO
B3
Testing
Flaky test quarantine
HIGH
HIGH
P2
FIDO + Booster
B4
Testing
Acceptance test visibility
LOW
LOW
P3
FIDO
B5
Testing
Sample test trigger expansion
MEDIUM
LOW
P2
Booster
B6
Testing
Test sharding
LOW
MEDIUM
P3
Booster
B7
Testing
E2E CLI smoke expansion
MEDIUM
LOW
P2
Booster + EECOM
C1
Security
SHA-pin all actions
HIGH
LOW
P1
Booster
C2
Security
Dependency review on PRs
HIGH
LOW
P1
Booster + FIDO
C3
Security
SAST/security scanning
HIGH
MEDIUM
P1
FIDO + Booster
C4
Security
CODEOWNERS file
MEDIUM
LOW
P2
Flight + Procedures
C5
Security
Team.md parsing hardening
MEDIUM
MEDIUM
P2
Booster + RETRO
C6
Security
License compliance
MEDIUM
LOW
P3
Procedures
D1
Performance
Build artifact caching
MEDIUM
MEDIUM
P2
Booster
D2
Performance
Conditional Playwright install
LOW
LOW
P3
Booster
D3
Performance
Document action_required
LOW
LOW
P3
Procedures
D4
Performance
Node version matrix expansion
MEDIUM
LOW
P2
Booster
E1
DX
.nvmrc file
LOW
LOW
P3
Booster
E2
DX
Local CI simulation
MEDIUM
MEDIUM
P2
Booster + EECOM
E3
DX
Actionable failure messages
LOW
LOW
P3
Booster
E4
DX
Status check name consistency
LOW
LOW
P3
Booster
F1
Observability
CI metrics collection
MEDIUM
MEDIUM
P2
Booster + Flight
F2
Observability
CI degradation alerts
MEDIUM
LOW
P2
Booster + Ralph
F3
Observability
Scheduled link checking
LOW
LOW
P3
Booster
F4
Observability
Build time regression detection
LOW
MEDIUM
P3
Booster
Recommended Implementation Order
Wave 1: Security + Reliability Foundation (P1, ~6-8 hours)
Ship these first — they prevent real incidents:
C1 SHA-pin all actions (30 min) — supply chain protection
Cost is not a concern — 145 min/week is well within GitHub's free tier. Cross-platform CI (Wave 2) will add ~50-100 min/week, still manageable.
The original 19-item analysis lives at docs/proposals/ci-hardening-opportunities.md in the bradygaster/squad repo. This Phase 3 goes deeper with 30 new items.
All items marked "Booster" — this is the CI/DevOps agent role. If no Booster exists, route to whoever owns .github/workflows/.
Filed by Flight. Dina: route from Wave 1 down. Each wave is independently shippable.
CI Hardening Phase 3: Deep Analysis — Post bradygaster#699/bradygaster#700/bradygaster#701
Author: Flight (Lead)
Date: 2026-03-30
Scope: Every workflow, action, test, and operational pattern in bradygaster/squad CI
Refs: #121 (original 19-item analysis), bradygaster#699 (workarounds), bradygaster#700 (Phase 1), bradygaster#701 (Phase 2)
What's Been Done (Phases 1-2)
Phase 1 — PR bradygaster#700 (MERGED) ✅
5 quick wins implemented:
Phase 2 — PR bradygaster#701 (OPEN, pending merge) 🔄
5 short-term items + security hardening:
6. Lockfile stability — drift detection in preflight (prevents v0.9.1-style incidents)
7. Composite action — DRY setup-node + npm install wrapper (
.github/actions/setup-squad-node/)8. Ralph cron audit — documented "no cron" as intentional design decision
9. API rate limit monitoring — warns <100, fails <10 remaining (state-only logging for security)
10. npm registry health check —
npm ping+ namespace check before publishSecurity fixes: shell injection prevention in composite action, lockfile integrity hash validation, rate limit count redaction from public logs.
PR bradygaster#699 (MERGED) ✅
Combined CI workarounds for @github/copilot-sdk ESM bug (vscode-jsonrpc/node module resolution).
Current CI Health Snapshot (2026-03-30)
145 min/week ($0.35/month)Phase 3: Deep Analysis — 30 Items Across 6 Categories
Category A: Reliability & Resilience
A1. Concurrency controls missing on most workflows
squad-docs.ymlhas concurrency settings. All other workflows can run multiple instances simultaneously — if 3 PRs push in quick succession, 3squad-ciruns compete for resources.concurrency: { group: '${{ github.workflow }}-${{ github.ref }}', cancel-in-progress: true }tosquad-ci.yml,squad-heartbeat.yml,squad-triage.yml,squad-label-enforce.yml,squad-issue-assign.ymlA2. Promote workflow uses dangerous merge strategy
squad-promote.ymluses--no-commit --no-ff -X theirsto merge dev→preview. This auto-resolves ALL conflicts by taking dev's version without human review.--no-ffwithout-X theirs. Fail on conflicts and require manual resolution. Add dry-run diff output showing exactly what would change.A3. No rollback mechanism for docs deployments
squad-docs.ymldeploys directly to GitHub Pages with no build output validation and no rollback. Broken docs go live instantly.A4. Insider publish creates inconsistent state on partial failure
squad-insider-publish.ymlpublishes SDK and CLI sequentially. If CLI publish fails after SDK succeeds, npm has mismatched SDK/CLI versions on theinsidertag.A5. Release/publish workflow coordination gap
squad-release.yml(push to main) andsquad-npm-publish.yml(release event) can trigger near-simultaneously. Release creates tag + GitHub Release, which triggers npm publish. No explicit handoff or dependency check.workflow_runtrigger onsquad-npm-publish.ymlthat waits forsquad-release.ymlcompletion instead of relying on release event timing.A6. Tag deduplication gaps in release workflows
squad-release.ymlandsquad-insider-release.ymlcreate tags. If a tag already exists,git pushfails.squad-release.ymlchecks for existing tags, butsquad-insider-release.ymldoes not.squad-insider-release.ymlmatching the pattern insquad-release.yml.Category B: Testing Depth
B1. No cross-platform CI (Windows/macOS)
ubuntu-latestonly. SDK/CLI published on npm for all platforms but never tested on Windows or macOS. Known platform-specific skips exist in tests (storage-provider, resolution, doctor).squad-cross-platform.ymlwith matrix:{os: [ubuntu-latest, windows-latest, macos-latest], node: [20, 22]}. Gate behindworkflow_dispatch+ weekly cron. Opt-in via repo variable.B2. Docker-dependent tests need skip guards
describe.skipIf(!hasDocker)guards. Must ship before B1.B3. Flaky test quarantine system
--reporter=jsonoutput + post-test analysis script. Track per-test pass/fail across last N runs. Quarantine tests with >20% but <100% failure rate. Post flake report to PR comment.B4. Acceptance tests (Gherkin BDD) not visible as separate CI gate
test/acceptance/with full step definitions and terminal harness. But they run as part of the monolithicnpm test— no visibility into acceptance test health vs unit test health.squad-ci.yml.B5. Sample tests only run when SDK source changes
samples-buildjob insquad-ci.ymlis gated onpackages/squad-sdk/src/changes. If a sample's own code breaks (e.g., bad import), CI won't catch it unless SDK is also changed.samples/**in trigger condition. Also add weekly cron run of samples-build.B6. No test sharding for faster execution
--shard=1/Nflag with matrix strategy when test count exceeds threshold (~300 files). Not needed now.B7. E2E CLI smoke test expansion
squad-npm-publish.ymlhas CLI packaging smoke test, but it only validatesnpm pack. No test of actual global install + command execution.npm install -g @bradygaster/squad-cli,squad --version,squad init --help,squad status.Category C: Security & Supply Chain
C1. Third-party actions not pinned to SHAs
actions/checkout,actions/setup-node) use@v4tags. Third-party actions insquad-docs-links.ymluse tags:lycheeverse/lychee-action@v2,peter-evans/create-issue-from-file@v5. Tags can be force-pushed by upstream maintainers.npx pin-github-actionor Dependabot's action pinning feature. Example:actions/checkout@b4ffde65f...instead of@v4.C2. No dependency review on PRs
dependency-review-actionconfigured. New dependencies added via PR are not automatically audited for vulnerabilities or license issues..github/dependabot.ymlfor automated PR creation on dependency updates. Addactions/dependency-review-action@v4to PR CI pipeline. Enable GitHub's built-in dependency graph.C3. No SAST/security scanning in CI
npm auditin CI. TypeScript strict mode and ESLint are the only static analysis.npm audit --audit-level=moderateas CI gate. Add CodeQL analysis workflow for JavaScript/TypeScript. Considergithub/codeql-action@v3with auto-build.C4. No CODEOWNERS file
.github/CODEOWNERS. PRs don't auto-assign reviewers based on file paths. Anyone can modify workflow files without required review from CI owners..github/CODEOWNERSwith:.github/workflows/* @bradygaster,packages/squad-sdk/* @eecom-owner,packages/squad-cli/* @eecom-owner,docs/* @pao-owner.C5. Team.md parsing is a fragile injection surface
.squad/team.mdor.ai-team/team.mdusing string splitting on|. The parsed values are used in GitHub API calls (label names, comments, assignments). A malicious team.md edit could inject unexpected values.C6. No license compliance scanning
license-checkerorlicenseecheck to CI. Maintain allowlist of approved licenses (MIT, Apache-2.0, ISC, BSD-*).Category D: Performance & Cost
D1. No build artifact caching between jobs
squad-ci.ymlrunnpm ciandnpm run buildindependently. Thetest,exports-map-check,export-smoke-test, andsamples-buildjobs all rebuild from scratch.actions/cacheoractions/upload-artifact+actions/download-artifactto share build output between jobs. Build once in abuildjob, download in dependent jobs.D2. Playwright browser install is unconditional
squad-ci.ymlalways installs Playwright browsers (~30s) even if no Playwright tests would run.D3.
action_requiredconclusions need investigationaction_requiredconclusion. These are likely first-time fork contributors needing workflow approval. But the pattern creates noise in CI history.action_requiredis expected for fork PRs. Consider auto-approve for known contributors. Add to CI runbook.D4. Expand Node version matrix for insider builds
squad-insider-publish.ymluses a matrix strategy but only tests Node 20. Main CI only tests Node 22. Node 18 (LTS until April 2025) is untested.[20, 22]. Add Node 22 validation to insider builds (currently only Node 20). Align withenginesfield (>=22.5.0).Category E: Developer Experience
E1. No
.nvmrcor.node-versionfilepackage.jsonspecifiesengines.node: ">=22.5.0"but there's no.nvmrcor.node-versionfile. Contributors must readpackage.jsonto know which Node version to use..nvmrcwith22or.node-versionwith22.5.0.E2. No local CI simulation (act config)
actconfig, no Makefile.Makefileorscripts/ci-local.shthat runs all CI gates in sequence: build, test, lint, changelog check, exports check, samples build. Optional: add.actrcfor GitHub Actions local runner.E3. CI failure messages could be more actionable
::error file=...annotations for file-specific failures.E4. Status check names not consistently prefixed
test,docs-quality,changelog-gate— but in the GitHub Checks UI, they appear under the workflow name "Squad CI". The individual job names aren't immediately recognizable in the PR status list.Category F: Observability & Monitoring
F1. No CI metrics collection
gh run listAPI and posts summary to a discussion or issue.F2. No alerts for CI degradation
gh run list. If pass rate drops below 80%, create an issue or send notification. Could integrate with existing Ralph heartbeat.F3. No scheduled link checking
squad-docs-links.ymlis manual-only. External links rot silently.squad-docs-links.yml:schedule: [{cron: '0 9 * * 1'}].F4. No build time regression detection
Summary Table
Recommended Implementation Order
Wave 1: Security + Reliability Foundation (P1, ~6-8 hours)
Ship these first — they prevent real incidents:
Wave 2: Testing Coverage + DX (P1-P2, ~8-10 hours)
Ship after Wave 1 — expands what CI catches:
Wave 3: Observability + Polish (P2-P3, ~6-8 hours)
Ship when bandwidth allows:
Backlog (P3, defer until needed):
18-30. All P3 items — implement when relevant or as quick wins during other work.
Agent Routing Summary
Notes
docs/proposals/ci-hardening-opportunities.mdin the bradygaster/squad repo. This Phase 3 goes deeper with 30 new items..github/workflows/.Filed by Flight. Dina: route from Wave 1 down. Each wave is independently shippable.