Skip to content

feat(identity): retry resilience + PR22 nits (H-03)#23

Merged
sabbour merged 2 commits intodevfrom
squad/identity-retry-resilience
Apr 21, 2026
Merged

feat(identity): retry resilience + PR22 nits (H-03)#23
sabbour merged 2 commits intodevfrom
squad/identity-retry-resilience

Conversation

@sabbour
Copy link
Copy Markdown
Owner

@sabbour sabbour commented Apr 21, 2026

Summary

Implements H-03 from the identity hardening roadmap and cleans up the 3 nits Flight flagged in PR #22 review.


H-03: Retry with exponential backoff

What changed

  • resolveTokenWithDiagnostics and resolveToken accept an optional options?: { retryPolicy?: RetryPolicy } parameter.
  • When retryPolicy is omitted, a single attempt runs — fully backward-compatible.
  • When provided (e.g. retryPolicy: {}), defaults apply: maxRetries: 2, initialDelayMs: 500ms, maxDelayMs: 4000ms.

Retry policy table

Condition Retries? Notes
Network error (fetch rejection) ECONNRESET, connection refused, etc.
5xx response Transient server error
429 with Retry-After Uses header delay exactly
429 without Retry-After Uses backoff
4xx except 429 Bad creds don't benefit from retry
AbortError / timeout Per-attempt budget already expired
not-configured Configuration problem, not transient

Timeout semantics

Each attempt has its own 10-second AbortController (created inside getInstallationToken). The timeout budget is not shared across retries. Worst-case wall time: (maxRetries + 1) × 10s + backoff delays ≈ 31.5s with defaults.

New SDK exports

  • GitHubApiError — typed error with status and retryAfterMs fields
  • RetryExhaustedError — marker for all-attempts-failed
  • RetryPolicy interface — maxRetries, initialDelayMs, maxDelayMs, onRetry, random
  • TokenResolveError.retriesExhausted: boolean — new field

Sample retry trace (via onRetry hook)

await resolveTokenWithDiagnostics(dir, 'lead', {
  retryPolicy: {
    onRetry: (attempt, reason, delayMs) =>
      console.log(`[retry] attempt ${attempt}: ${reason} (wait ${delayMs}ms)`),
  },
});
// [retry] attempt 1: GitHub API error 503: Service Unavailable (wait 487ms)
// [retry] attempt 2: GitHub API error 503: Service Unavailable (wait 1023ms)

PR #22 Nit Fixes (Flight's review — doc)

N-1 (tokens.ts:230-250): getInstallationPermissions previously made a redundant GET /installation/repositories?per_page=1 before the actual GET /installation call. The first response was checked for .ok but otherwise unused. Now: single GET /installation call only.

N-2 (tokens.ts:225-262): Shared AbortController across two sequential fetches meant worst-case combined latency was ~20s. Fixed: one controller per fetch. (N-1 fix also removes the second fetch, making N-2 moot for the shared-controller issue.)

N-3 (identity.ts:1280-1285): WSL NTFS-mounted paths (e.g. /mnt/c/) return stat.mode & 0o777 === 0o777 for all files — drvfs doesn't track POSIX permissions. The mode-0o600 doctor check now detects this and returns ⚠ skipped (drvfs): mode bits unreliable on NTFS-mounted paths instead of failing.


Test coverage

Test file Before After
test/identity/retry.test.ts 0 12 (new)
test/identity/doctor.test.ts 10 11 (+1 drvfs case)
All identity tests 164 177

New retry tests cover: success (no retry), 500 retry, 429 + Retry-After, exhaustion → retriesExhausted:true, AbortError no-retry, not-configured no-retry, onRetry callback args, jitter seam (injectable random), 4xx no-retry, network error retry, GitHubApiError fields, RetryExhaustedError fields.


Docs

  • docs/identity/retry-policy.md — retry contract, policy table, timeout semantics, testing patterns
  • .copilot/skills/injectable-random/SKILL.md — reusable pattern for injectable randomness seams in tests

Working as EECOM (Core Dev)

- Add RetryPolicy interface with maxRetries/initialDelayMs/maxDelayMs/onRetry/random
- Add GitHubApiError class (carries status + retryAfterMs for Retry-After support)
- Add RetryExhaustedError marker class for caller diagnosis
- resolveTokenWithDiagnostics/resolveToken accept optional retryPolicy — opt-in,
  backward-compatible. Each retry gets its own 10s AbortController budget.
- TokenResolveError gains retriesExhausted: boolean field
- Export GitHubApiError, RetryExhaustedError, RetryPolicy from SDK public API

N-1: getInstallationPermissions — single GET /installation call (removed redundant
     /installation/repositories preflight)
N-2: getInstallationPermissions — dedicated AbortController per fetch
N-3: doctor mode-0o600 check — detect drvfs quirk (mode=0o777 on NTFS-mounted WSL
     paths) and skip assertion with ⚠ skipped (drvfs) detail

Tests: 12 new retry cases + 1 drvfs doctor case (177 total, was 164)
Docs: docs/identity/retry-policy.md
Skill: .copilot/skills/injectable-random/SKILL.md

Closes #H-03

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Owner Author

@sabbour sabbour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Flight Verdict: Approve

Reviewer: Flight (Lead) · 2026-04-21

Self-review constraint: GitHub blocks --approve on own PR. Verdict is APPROVE — requesting Ahmed or another maintainer to apply the formal approval.

All 10 hard checks pass. All 3 PR #22 nits verified fixed. Clean scope (H-03 + nits only), no protected file touches, no scope creep.

Hard Check Summary

# Check Result
1 Changeset names @bradygaster/squad-sdk: minor, @bradygaster/squad-cli: patch
2 Opt-in retry (zero change without policy)
3 Retry filter correctness ✅ All 7 conditions verified + tested
4 Fresh AbortController per attempt
5 Jitter determinism (injectable random)
6 Retry-After honoring on 429
7 retriesExhausted field correctness
8 Token leakage ✅ None
9 Protected files (resolve-token.mjs) ✅ Untouched
10 Scope creep ✅ H-03 + 3 nits only

PR #22 Nit Fixes

  • N-1 (redundant preflight GET): ✅ Single GET /installation now
  • N-2 (shared AbortController): ✅ Moot with N-1, but correct
  • N-3 (drvfs false-fail): ✅ Checks mode === 0o777 specifically, returns ⚠ skipped (drvfs)

New Nit (non-blocking)

N-1: test/identity/retry.test.ts:33 — dead import withRetry as _withRetry. withRetry is module-private (not exported from barrel). Variable never used. Remove the line.

Positive Call-Outs

  1. Opt-in design — zero overhead for existing callers
  2. GitHubApiError with typed status + retryAfterMs fields — no string parsing
  3. Injectable random seam — deterministic tests, codified as reusable skill
  4. Retry-After honoring — correct 429 behavior vs pure exponential
  5. 12 focused tests covering every filter branch
  6. Clean docs (docs/identity/retry-policy.md) + thorough changeset

Full review: docs/reviews/pr-23-retry-review-2026-04-21.md

Merge action: ✅ Merge when ready. Nit can be cleaned in follow-up.

Review artifacts for PR #23 (identity retry resilience + PR #22 nit cleanup).
Verdict: Approve. All 10 hard checks pass, all 3 PR #22 nits verified fixed.
One non-blocking nit flagged (dead import in test file).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sabbour sabbour merged commit 7829f6e into dev Apr 21, 2026
@sabbour sabbour deleted the squad/identity-retry-resilience branch April 21, 2026 09:30
sabbour added a commit that referenced this pull request Apr 21, 2026
… cleanup (#24)

* docs(identity): GITHUB_TOKEN vs GH_TOKEN precedence (H-13) + test nit cleanup

- Add docs/identity/token-precedence.md documenting token precedence with
  decision tables, common scenarios, troubleshooting, and recommended usage
- Remove unused withRetry import from test/identity/retry.test.ts (PR #23 nit)
- Add changeset for docs-only patch

Closes H-13

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: add Flight review for PR #25 (resolve-token canonicalization)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(docs): correct gh CLI precedence order in scenario 3 (PR #24 review)

B-1: Fix inverted precedence list in Scenario 3 — GITHUB_TOKEN ranks above
stored gh auth credentials per official gh CLI docs, not below them.

N-1: Soften 'Cannot be explicitly set' wording on GITHUB_TOKEN — users
can override via permissions: key or step-level env: variable.

N-2: Clarify Summary table 'Local development' primary token is only
stored gh auth 'when no env vars set'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Leela Lead Bot <bot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant