Skip to content

fix: lazy Keychain reads to stop hourly password prompts#456

Merged
PureWeen merged 5 commits intomainfrom
fix/auth-keychain-lazy
Mar 30, 2026
Merged

fix: lazy Keychain reads to stop hourly password prompts#456
PureWeen merged 5 commits intomainfrom
fix/auth-keychain-lazy

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

After PR #446 merged, a user reported: "PP now restarts every hour or so, and asks for my login password to authenticate to copilot-cli"

Root Cause Analysis

Three interacting bugs create a recurring password prompt cycle:

Bug 1: Preemptive Keychain read at startup

InitializeAsync (line 932) called ResolveGitHubTokenForServer() unconditionally in Persistent mode. This runs security find-generic-password -s "copilot-cli" -w, which triggers a macOS Keychain ACL password dialog because PolyPilot is not in the ACL for the copilot-cli entry (only the terminal copilot binary that created it is). Every user saw this dialog on every app launch — even users whose server could self-authenticate fine.

Bug 2: Polling loop re-reads Keychain every 10 seconds

The auth polling loop at Utilities.cs:916 called ResolveGitHubTokenForServer() whenever it detected auth went from false → true. Since polling runs every 10s, this means:

  • If the Keychain dialog times out (3s RunProcessWithTimeout), token comes back null → server starts without auth → polling detects no auth → another dialog in 10 seconds (tight loop)
  • If the user clicks Allow, the token is static and will expire

Bug 3: Static token expiration creates hourly cycle

The gho_* OAuth token forwarded as COPILOT_GITHUB_TOKEN env var is a static snapshot. The copilot CLI normally refreshes tokens via keytar.node, but an env var bypasses that refresh mechanism. When the token expires (~1h, aligned with WatchdogMaxProcessingTimeSeconds = 3600):

  1. All sessions fail → watchdog fires after 2 consecutive timeouts
  2. TryRecoverPersistentServerAsync restarts server with same stale cached token
  3. Server still unauthenticated → CheckAuthStatusAsync starts polling
  4. Polling detects auth → calls ResolveGitHubTokenForServer()Keychain dialog again

Additionally, the ResolveGitHubTokenForServer() call in the polling loop was the one call site not wrapped in Task.Run (the R5/R7 fixes covered InitializeAsync and ReauthenticateAsync but missed this third site).

Design Decisions

Why not just remove Keychain reads entirely?

Because it would regress the original PR #446 fix. The original issue was users getting "Session was not created with authentication info or custom provider" because the headless server binary cannot access the Keychain (macOS ACL restriction — different binary path). If we remove Keychain reads, those users have no way to authenticate — copilot login writes to Keychain under the terminal binary's ACL, so the headless server still can't read it. The Keychain read is the only fix for users without gh CLI or env vars set.

Why split into ResolveGitHubTokenFromEnv vs ResolveGitHubTokenForServer?

To make the safety boundary explicit in the API:

  • ResolveGitHubTokenFromEnv()always safe, no subprocess, no prompt, instant. Used at startup.
  • ResolveGitHubTokenForServer()dangerous, may trigger Keychain dialog. Only called on explicit user action or after confirmed auth failure.

The doc comments on ResolveGitHubTokenForServer now include a ⚠️ warning and reference the skill invariants (INV-A1, INV-A2) to prevent future agents from calling it preemptively.

Why lazy resolution in CheckAuthStatusAsync instead of just showing the banner?

For users whose server genuinely can't self-authenticate (the original PR #446 users), showing a banner and telling them to run copilot login doesn't help — they've already done that. The Keychain entry exists; the server just can't read it. Lazy resolution means:

  1. Server starts without token (no prompt)
  2. CheckAuthStatusAsync detects auth failure
  3. First time only (_resolvedGitHubToken == null): resolves full chain including Keychain → one password dialog → caches token → restarts server
  4. If the restart works → user is authenticated with zero manual intervention after the one dialog
  5. On subsequent failures (token expiry): uses cached token, no new dialog

Why does the polling loop no longer re-resolve?

The polling loop runs every 10 seconds. Re-reading the Keychain on every cycle means a password dialog every 10 seconds if the user doesn't respond. Instead, the polling loop now uses _resolvedGitHubToken (already cached from the lazy resolution or from ReauthenticateAsync). Only the explicit "Re-authenticate" button triggers a fresh Keychain read — that's an intentional user action where a password prompt is expected.

Changes

File Change
CopilotService.Utilities.cs Split ResolveGitHubTokenForServer into env-only (ResolveGitHubTokenFromEnv) and full version
CopilotService.Utilities.cs CheckAuthStatusAsync: lazy full-chain resolve on first auth failure, auto-restart with token
CopilotService.Utilities.cs Polling loop: use cached token only, removed ResolveGitHubTokenForServer() call
CopilotService.cs InitializeAsync: use ResolveGitHubTokenFromEnv() (no Keychain, no prompt)
ServerRecoveryTests.cs Add test for ResolveGitHubTokenFromEnv
.claude/skills/auth-token-safety/SKILL.md New skill: 9 invariants from PR #446 regression analysis

Behavior Matrix

User type Before (PR #446) After (this PR)
Server self-authenticates ❌ Password dialog at every startup ✅ Zero prompts
Keychain ACL issue, first launch ❌ Password dialog at startup ✅ One dialog after auth failure detected
Keychain ACL issue, token expires ❌ Password dialog every hour ✅ No dialog (uses cached token)
Keychain dialog timeout (3s) ❌ Dialog every 10s (tight loop) ✅ No loop (polling uses cache)
User clicks Re-authenticate ✅ Fresh Keychain read ✅ Fresh Keychain read (unchanged)
Has GH_TOKEN env var ✅ No dialog ✅ No dialog (env-only at startup)

Tests

  • 3059/3059 pass ✅
  • New test: ResolveGitHubTokenFromEnv_ReturnsNull_WhenNoEnvVarsSet

Related

PureWeen and others added 2 commits March 30, 2026 09:21
Root cause: ResolveGitHubTokenForServer() reads macOS Keychain via
`security find-generic-password -w`, which triggers a password dialog.
This was called preemptively at startup (InitializeAsync) and on every
auth polling cycle (every 10s when auth fails). Combined with static
token expiration (~1h), this created recurring password prompts.

Changes:
- Split ResolveGitHubTokenForServer into ResolveGitHubTokenFromEnv
  (safe, no prompt) and full version (with Keychain, may prompt)
- InitializeAsync: only check env vars at startup, no Keychain read
- CheckAuthStatusAsync: on first auth failure, lazily resolve full
  token chain (including Keychain) and auto-restart server with it
- Auth polling loop: use cached token only, never re-read Keychain
  (was calling ResolveGitHubTokenForServer on every detection cycle)
- ReauthenticateAsync: unchanged — explicit user action, prompt OK
- Add auth-token-safety skill with 9 invariants from PR #446

This ensures:
- Users whose server self-authenticates: zero password prompts
- Users with Keychain ACL issue: one prompt on first failure, cached
- No hourly re-prompting from polling or token expiration cycles

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- TryRecoverPersistentServerAsync: clear _resolvedGitHubToken before
  restart so CheckAuthStatusAsync treats next auth failure as first-time
  and triggers lazy Keychain resolution (INV-A3)
- CheckAuthStatusAsync: call FetchGitHubUserInfoAsync after successful
  lazy restart (matches polling and ReauthenticateAsync paths)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔍 Multi-Model Code Review — PR #456 v2 (2 commits)

Models: Claude Opus 4.6 · Claude Sonnet 4.6 · GPT-5.3-Codex
CI: ⚠️ No checks configured
Prior comments: None


🔴 CRITICAL — Lazy-resolved token is immediately discarded by TryRecoverPersistentServerAsync

Flagged by: Opus 🔴 · Sonnet 🟡 · GPT 🔴 (3/3)

Files: CopilotService.Utilities.cs ~L880 + CopilotService.cs ~L1304

In CheckAuthStatusAsync, the lazy path resolves a token (potentially triggering a Keychain dialog), sets _resolvedGitHubToken = fullToken, then calls TryRecoverPersistentServerAsync(). But commit 2 added _resolvedGitHubToken = null at the top of TryRecoverPersistentServerAsync, which immediately discards the just-resolved token:

// CheckAuthStatusAsync lazy path:
_resolvedGitHubToken = fullToken;              // ← set to freshly-resolved token
var recovered = await TryRecoverPersistentServerAsync();
// Inside TryRecoverPersistentServerAsync:
_resolvedGitHubToken = null;                   // ← discards it!
var started = await _serverManager.StartServerAsync(settings.Port, _resolvedGitHubToken); // always null

Impact: The server always starts with a null token. For users who need token forwarding (Keychain ACL issue), the lazy resolution triggers a password dialog but the token is thrown away. The feature from commit 1 is effectively broken by commit 2.

Additionally, ReauthenticateAsync suffers a double-dialog regression:

  1. User clicks Re-authenticate → resolves token (dialog Polish UI, Rename Sessions, Markdown Output Support, Queued Messages #1)
  2. TryRecoverPersistentServerAsync clears token → server starts without it
  3. CheckAuthStatusAsync sees _resolvedGitHubToken == null → lazy resolve fires → dialog Add mobile sidebar top-bar & flyout UI #2
  4. Token is discarded again → user is still not authenticated after 2 dialogs

Suggested fix: Save the token before clearing and pass it to StartServerAsync:

var tokenToForward = _resolvedGitHubToken;
_resolvedGitHubToken = null;  // clear for future lazy resolution
var started = await _serverManager.StartServerAsync(settings.Port, tokenToForward);

🟡 MODERATE — _resolvedGitHubToken unsynchronized across threads

Flagged by: Opus 🟡 · Sonnet 🟡 · GPT 🟡 (3/3)

File: CopilotService.Utilities.cs ~L877 (lazy-init block)

The check-then-act on _resolvedGitHubToken is not atomic. Two concurrent auth failures can both pass the == null check, both perform the expensive Keychain read (double dialog), and both race to set the field. The _recoveryLock inside TryRecoverPersistentServerAsync serializes recovery but the Keychain read happens before the lock.

String reference assignment is atomic in .NET, but without volatile, CPU caches may serve stale values. Consider Interlocked.CompareExchange or a SemaphoreSlim around the lazy-init block.

Practical risk: Low — CheckAuthStatusAsync is typically called from a single path. But the field is also written from ReauthenticateAsync (UI thread) and read from the polling timer (background thread).


🟢 MINOR — Test is a tautology (always passes)

Flagged by: Opus 🟡 · Sonnet 🟢 (2/3)

File: ServerRecoveryTests.cs

Assert.True(token == null || token.Length > 0);

Since ResolveGitHubTokenFromEnv filters with !string.IsNullOrWhiteSpace, it can never return an empty string. The assertion is always true — it validates nothing. In CI where GH_TOKEN is set, the method returns a real token without testing the null path. Should either clear env vars and assert null, or set a test token and verify it's returned.


✅ Non-Issues (confirmed safe)

  • Infinite recursion: TryRecoverPersistentServerAsync does not call CheckAuthStatusAsync. No cycle possible. (3/3 agree)
  • _client! after recovery: When recovered == true, _client is guaranteed set (line ~1337). The try-catch guards edge cases. (2/3 agree safe; Sonnet noted a theoretical TOCTOU but did not meet consensus threshold)
  • Polling with null token: Polling checks server-side status.IsAuthenticated, not the local token. Correct behavior.

📋 Test Coverage

Path Tested?
ResolveGitHubTokenFromEnv basic ⚠️ Tautological test
Lazy resolution in CheckAuthStatusAsync ❌ Not tested
Token clearing in TryRecoverPersistentServerAsync ❌ Not tested
ReauthenticateAsync → double dialog scenario ❌ Not tested

Verdict: ⚠️ Request Changes

The CRITICAL finding (commit 2 discards lazy-resolved tokens) directly contradicts the PR's goal of "one dialog, then cached." The fix is small (save token to local before clearing), but without it the token forwarding mechanism is broken for Keychain ACL issue users.

The MODERATE threading concern is a lower priority — it's a pre-existing pattern, not a regression from this PR.

PureWeen and others added 3 commits March 30, 2026 10:07
…rAsync

The previous commit cleared _resolvedGitHubToken=null before passing it
to StartServerAsync, which discarded any freshly-resolved token from
CheckAuthStatusAsync (lazy path) or ReauthenticateAsync. This caused:
- Lazy-resolved token thrown away → server always starts with null
- ReauthenticateAsync double-dialog: resolve token → cleared → server
  unauthenticated → CheckAuthStatusAsync resolves again → 2nd prompt

Fix: save to local variable before clearing. Callers that resolved a
fresh token forward it; watchdog callers (where token was already null)
get identical behavior. The field is still cleared for future lazy
resolution on token expiry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. SemaphoreSlim(1,1) around lazy token resolution in CheckAuthStatusAsync
   with try-enter + double-check pattern. Prevents concurrent auth failures
   from both triggering Keychain dialogs.

2. Watchdog and health-check recovery callers now call CheckAuthStatusAsync
   after TryRecoverPersistentServerAsync succeeds. Prevents silently
   unauthenticated state when server starts without a token and can't
   self-authenticate. Placed in callers (not inside recovery method) to
   avoid re-entrancy with the lazy resolution path.

3. Replace tautological env-var tests with isolated assertions:
   - ResolveGitHubTokenFromEnv_ReturnsNull: saves/clears/restores env vars
   - ResolveGitHubTokenFromEnv_ReturnsToken: sets and verifies value
   - ResolveGitHubTokenFromEnv_RespectsPrecedence: COPILOT_GITHUB_TOKEN
     wins over GH_TOKEN

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…history

Add three new invariants discovered during PR #456:
- INV-A10: SemaphoreSlim thread-safe lazy resolution (double-dialog prevention)
- INV-A11: CheckAuthStatusAsync must be called after recovery in callers
- INV-A12: Save-then-clear token pattern in TryRecoverPersistentServerAsync

Add regression patterns 5-6. Update PR history with PR #456 details.
Update description triggers with new method/field names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔍 Multi-Model Code Review v3 — PR #456 (5 commits)

Models: Claude Opus 4.6 · Claude Sonnet 4.6 · GPT-5.3-Codex
CI: ⚠️ No checks configured
Prior comments: 1 (v2 review)


v2 Findings Status

v2 Finding Status
🔴 Token discarded by TryRecoverPersistentServerAsync Fixed (commit 3: tokenToForward local) — 3/3 confirm
🟡 _resolvedGitHubToken unsynchronized Fixed (commit 4: SemaphoreSlim + double-check locking) — 3/3 confirm
🟢 Tautological test Fixed (commit 4: 3 real tests with env var isolation) — 3/3 confirm

New Findings

🟡 MODERATE — Fire-and-forget _ = CheckAuthStatusAsync() loses exceptions

Flagged by: Opus 🟡 · Sonnet 🟢 · GPT 🟡 (3/3)

Files: CopilotService.cs ~L747 (health check) · CopilotService.Events.cs ~L2504 (watchdog)

var recovered = await TryRecoverPersistentServerAsync();
if (recovered) _ = CheckAuthStatusAsync();  // exceptions silently lost

Since CheckAuthStatusAsync is async, all its exceptions are captured in the returned Task — the outer catch (Exception recoverEx) never sees them. If the semaphore is disposed during shutdown, ObjectDisposedException is silently swallowed.

Suggested fix: await instead of _ = so exceptions flow into the existing catch block. The outer Task.Run is already fire-and-forget, so await doesn't block any caller:

var recovered = await TryRecoverPersistentServerAsync();
if (recovered) await CheckAuthStatusAsync();

Note: CheckAuthStatusAsync does have its own top-level try-catch (confirmed in source), so this is defense-in-depth rather than a crash risk. But the pattern is inconsistent with how the service handles other fire-and-forget paths and would silently swallow unexpected errors (e.g., ObjectDisposedException during shutdown).


🟢 MINOR — Env var tests not parallel-safe

Flagged by: Opus 🟢 · Sonnet 🟡 (2/3)

File: ServerRecoveryTests.cs

The 3 new tests use Environment.SetEnvironmentVariable (process-global). xUnit runs test classes in parallel by default. If any other test class reads COPILOT_GITHUB_TOKEN/GH_TOKEN/GITHUB_TOKEN concurrently, it could see test values.

Suggested fix: Add [Collection("BaseDir")] (already used by other test classes in this project) to serialize.


✅ Confirmed Clean

  • Double-check locking (WaitAsync(0) + inner null check): Correct pattern. Non-blocking skip is intentional; loser falls through to polling.
  • Token clear-and-forward in TryRecoverPersistentServerAsync: tokenToForward local prevents use-after-clear.
  • ReauthenticateAsync double-dialog: Only triggers when the first token was bad (server still not authenticated). This is correct — a fresh Keychain read is the right behavior when the previous token failed.
  • ResolveGitHubTokenFromEnv at startup: Eliminates hourly prompts while preserving lazy resolution for users who need Keychain tokens.
  • _tokenResolutionLock disposal: Correctly added alongside _recoveryLock.Dispose().

📋 Test Coverage

Path Tested?
ResolveGitHubTokenFromEnv null return
ResolveGitHubTokenFromEnv token return
ResolveGitHubTokenFromEnv precedence
Lazy resolution in CheckAuthStatusAsync ❌ (integration-level, hard to unit test)
Token forwarding in TryRecoverPersistentServerAsync ❌ (integration-level)

Verdict: ✅ Approve

All 3 v2 findings are resolved. The remaining MODERATE (fire-and-forget exception swallowing) is low-risk because CheckAuthStatusAsync already has its own try-catch, but would be cleaner as await. The MINOR test parallelism issue is a flake risk, not a correctness issue. Overall this is a well-reasoned fix for a real user-impacting regression.

@PureWeen PureWeen merged commit 44c996a into main Mar 30, 2026
@PureWeen PureWeen deleted the fix/auth-keychain-lazy branch March 30, 2026 15:49
@PureWeen
Copy link
Copy Markdown
Owner Author

🔬 Fix Verification Analysis — PR #456

Question: Does this PR actually fix the hourly macOS Keychain password prompts?

Answer: YES ✅ — with one nuance for Keychain-only users.


Path-by-Path BEFORE vs AFTER

1. InitializeAsync (every app launch)

BEFORE AFTER
Calls ResolveGitHubTokenForServer() ResolveGitHubTokenFromEnv()
Keychain read? YES — every launch 🔴 NO
Dialog? Every launch Never

2. Polling loop (every 10s when not authenticated)

BEFORE AFTER
Calls ResolveGitHubTokenForServer() on auth detection (removed — uses cached token)
Keychain read? YES — every 10s cycle 🔴 NO
Dialog? Every 10s if dialog times out Never

3. CheckAuthStatusAsync lazy resolution (NEW)

BEFORE AFTER
Exists? No Yes
Fires when? N/A First auth failure AND _resolvedGitHubToken == null
Keychain read? N/A YES — once, guarded by SemaphoreSlim
Token forwarded? N/A ✅ Saved to tokenToForwardStartServerAsync

4. ReauthenticateAsync (user clicks button)

BEFORE AFTER
Keychain read? Yes Yes (unchanged — correct, explicit user action) ✅
Token forwarded? Directly via _resolvedGitHubToken Via tokenToForward local (save-then-clear pattern) ✅

Critical Scenario: Keychain-Only User (no env vars)

  1. Startup: ResolveGitHubTokenFromEnv() → null. Server starts without token.
  2. Server native auth fails (Keychain ACL restriction)
  3. CheckAuthStatusAsync → detects auth failure → lazy block fires
  4. One Keychain dialog → user clicks Allow → token cached in _resolvedGitHubToken
  5. TryRecoverPersistentServerAsynctokenToForward = token → clears field → StartServerAsync(port, tokenToForward)server gets the token
  6. Server authenticated → user is in ✅

Token Expiry Scenario (~1 hour later)

  1. Token expires → watchdog fires → TryRecoverPersistentServerAsync (clears field)
  2. CheckAuthStatusAsync → sees _resolvedGitHubToken == null → lazy block fires → one more dialog
  3. Fresh token → restart → authenticated again

This means Keychain-only users get ~1 dialog per token expiry (roughly hourly). However, this is a massive improvement:

Scenario BEFORE AFTER
Startup dialogs 1 per launch 0
Polling dialogs up to 360/hour (every 10s) 0
Expiry dialogs 1/hour + polling cascade ~1/hour (lazy only)
Total per hour ~362 dialogs ~1 dialog

The only way to eliminate the hourly dialog entirely would be to persist the Keychain token to disk across restarts, which is a separate enhancement.


Verdict

YES — this PR fixes the original issue. The primary bugs (preemptive startup read + polling loop re-read) are eliminated. The hourly token expiry path is reduced from a cascade of hundreds of dialogs to a single lazy resolution. Manual re-auth via the button continues to work correctly.

PureWeen added a commit that referenced this pull request Apr 1, 2026
## Problem

Every couple of hours, the user gets prompted to "allow copilot-cli" and
must enter their macOS login password ~5 times. This is a regression
from PR #456 — that fix addressed the 10-second polling loop but missed
a second cache-clearing path.

## Root Cause

`TryRecoverPersistentServerAsync()` clears `_resolvedGitHubToken = null`
on every recovery cycle (line 1339). Recovery is triggered by:
- Watchdog timeouts (token expiry ~1-8h)
- Wake/sleep health check failures
- Auth polling success → server restart

After clearing, the next `CheckAuthStatusAsync()` call sees
`_resolvedGitHubToken == null` → enters the lazy Keychain resolution
path → spawns `security find-generic-password -s copilot-cli -w` →
**macOS password dialog**.

**Re-reading the Keychain is useless** — the stored token is a static
snapshot from `copilot login`. If it expired, re-reading returns the
same expired token. Only running `copilot login` again would write a new
one.

## Fix

Remove the `_resolvedGitHubToken = null` from
`TryRecoverPersistentServerAsync`. The cached token is still forwarded
to the new server via `tokenToForward`. Only two paths now clear the
cache:
- `ReconnectAsync` — explicit user action (settings change)
- `ReauthenticateAsync` — explicit user action (Re-authenticate button)

When the token expires, the auth banner appears and the user clicks
Re-authenticate → fresh Keychain read (correct — user-initiated).

## Changes

| File | Change |
|------|--------|
| `CopilotService.cs` | Remove `_resolvedGitHubToken = null` from
recovery path |
| `auth-token-safety/SKILL.md` | Update INV-A3 to reflect new invariant
|

## Tests
3064/3064 pass ✅

## Related
- Fixes regression from #456 (which fixed regression from #446)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 1, 2026
## Problem

PR #446 added Keychain-reading code (TryReadCopilotKeychainToken,
ResolveGitHubTokenForServer, RunProcessWithTimeout) to help users
whose headless server could not self-authenticate. This caused a
4-PR regression chain (#446#456#462#463):

1. Each /usr/bin/security call triggers a macOS password dialog
   (PolyPilot is not in the copilot-cli Keychain ACL)
2. TryReadCopilotKeychainToken tried 3 service names sequentially,
   each spawning a separate dialog (3× password prompts per call)
3. Clicking Allow/Deny rewrote the Keychain ACL, breaking the
   server own native keytar access
4. The server fell back to its own /usr/bin/security calls, creating
   a spiral of recurring password prompts every 1-2 hours

## Root Cause Analysis

The headless copilot server authenticates on its own at startup via
its native credential store. This has worked reliably across dozens
of worktree switches (different binary paths each time). The original
PR #446 user issue ("Session was not created with authentication
info") was a server auth loss that should have been solved by
restarting the server — which TryRecoverPersistentServerAsync already
does — not by reading the Keychain from the UI process.

## Changes

- Remove ResolveGitHubTokenForServer (Keychain + gh CLI reads)
- Remove TryReadCopilotKeychainToken (3-service-name loop)
- Remove RunProcessWithTimeout (only used by above)
- Remove _tokenResolutionLock SemaphoreSlim (guarded lazy path)
- Remove lazy Keychain resolution path in CheckAuthStatusAsync
- Remove sentinel logic (_resolvedGitHubToken ??= string.Empty)
- Simplify ReauthenticateAsync to just restart the server
- Keep ResolveGitHubTokenFromEnv (env vars are safe, no prompts)
- Keep auth banner + Re-authenticate button (correct UX)
- Rewrite auth-token-safety skill doc with new invariant
- Remove 7 tests for deleted methods, keep 3 env var tests

## For users who cannot self-authenticate

The auth banner says: "run copilot login in a terminal, then click
Re-authenticate." Re-authenticate restarts the server, which picks
up the fresh credentials. This was the original PR #446 design before
Keychain code was added during review rounds.

Tests: 3057/3057 pass ✅

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 1, 2026
## Problem

PR #446 added Keychain-reading code (TryReadCopilotKeychainToken,
ResolveGitHubTokenForServer, RunProcessWithTimeout) to help users
whose headless server could not self-authenticate. This caused a
4-PR regression chain (#446#456#462#463):

1. Each /usr/bin/security call triggers a macOS password dialog
   (PolyPilot is not in the copilot-cli Keychain ACL)
2. TryReadCopilotKeychainToken tried 3 service names sequentially,
   each spawning a separate dialog (3× password prompts per call)
3. Clicking Allow/Deny rewrote the Keychain ACL, breaking the
   server own native keytar access
4. The server fell back to its own /usr/bin/security calls, creating
   a spiral of recurring password prompts every 1-2 hours

## Root Cause Analysis

The headless copilot server authenticates on its own at startup via
its native credential store. This has worked reliably across dozens
of worktree switches (different binary paths each time). The original
PR #446 user issue ("Session was not created with authentication
info") was a server auth loss that should have been solved by
restarting the server — which TryRecoverPersistentServerAsync already
does — not by reading the Keychain from the UI process.

## Changes

- Remove ResolveGitHubTokenForServer (Keychain + gh CLI reads)
- Remove TryReadCopilotKeychainToken (3-service-name loop)
- Remove RunProcessWithTimeout (only used by above)
- Remove _tokenResolutionLock SemaphoreSlim (guarded lazy path)
- Remove lazy Keychain resolution path in CheckAuthStatusAsync
- Remove sentinel logic (_resolvedGitHubToken ??= string.Empty)
- Simplify ReauthenticateAsync to just restart the server
- Keep ResolveGitHubTokenFromEnv (env vars are safe, no prompts)
- Keep auth banner + Re-authenticate button (correct UX)
- Rewrite auth-token-safety skill doc with new invariant
- Remove 7 tests for deleted methods, keep 3 env var tests

## For users who cannot self-authenticate

The auth banner says: "run copilot login in a terminal, then click
Re-authenticate." Re-authenticate restarts the server, which picks
up the fresh credentials. This was the original PR #446 design before
Keychain code was added during review rounds.

Tests: 3057/3057 pass ✅

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 1, 2026
## Problem

PR #446 added Keychain-reading code to help users whose headless server
couldn't self-authenticate. This caused a **4-PR regression chain**
(#446#456#462#463) of recurring macOS password dialogs:

1. `TryReadCopilotKeychainToken` spawned `/usr/bin/security
find-generic-password` for 3 service names sequentially — **3 password
dialogs per call**
2. Clicking Allow/Deny on those dialogs **rewrote the Keychain ACL**,
breaking the server's own native keytar access
3. The server fell back to its own `/usr/bin/security` calls → **more
dialogs every 1-2 hours**
4. PRs #456, #462, #463 each fixed one trigger path but the core
approach was wrong

## Root Cause

The headless copilot server **authenticates on its own** at startup via
its native credential store. This has worked reliably across dozens of
worktree switches (different binary paths each time) without any
Keychain intervention from PolyPilot.

The original PR #446 user issue ("Session was not created with
authentication info") was a server auth loss that should have been
solved by **restarting the server** — which
`TryRecoverPersistentServerAsync` already does — not by reading the
Keychain from the UI process.

## Fix

Remove all Keychain-reading code entirely (**-612 lines, +42 lines**):

| Removed | Why |
|---------|-----|
| `ResolveGitHubTokenForServer()` | Keychain + gh CLI reads — triggers
password dialogs |
| `TryReadCopilotKeychainToken()` | 3-service-name loop — 3× password
dialogs per call |
| `RunProcessWithTimeout()` | Only used by above |
| `_tokenResolutionLock` SemaphoreSlim | Guarded the lazy Keychain path
|
| Lazy resolution path in `CheckAuthStatusAsync` | The whole Keychain
auto-resolution mechanism |
| Sentinel logic (`_resolvedGitHubToken ??= ""`) | No longer needed
without the lazy path |

| Kept | Why |
|------|-----|
| `ResolveGitHubTokenFromEnv()` | Env vars are safe, no prompts |
| `CheckAuthStatusAsync` + auth banner | Correct — shows "run copilot
login" guidance |
| `TryRecoverPersistentServerAsync` | Correct — restarts server which
re-authenticates on its own |
| Re-authenticate button | Correct — restarts server to pick up fresh
`copilot login` credentials |

## For users who cannot self-authenticate

The auth banner says: _"run `copilot login` in a terminal, then click
Re-authenticate."_ This was the **original PR #446 design** before
Keychain code was added during review rounds.

## Timeline of the regression

| PR | What it did | What went wrong |
|----|-------------|-----------------|
| #446 | Added Keychain reads to forward token to server | Triggered
password dialogs; corrupted Keychain ACL |
| #456 | Made Keychain reads lazy (only on first auth failure) | Still
fired on every server recovery cycle |
| #462 | Stopped recovery from clearing the token cache | Token was
never set in the first place (no env var) |
| #463 | Added sentinel on auth success | Server's own internal Keychain
reads still fired |
| **#465** | **Removed all Keychain code** | **The right fix — server
self-authenticates** |

## Tests
3057/3057 pass ✅ (7 tests for deleted methods removed, 3 env var tests
kept)

## Why server self-authentication works

The copilot headless server binary bundled inside PolyPilot.app and the
Homebrew `copilot` binary are both signed with the **same GitHub
Developer ID certificate**. macOS Keychain ACLs use code-signing
identity (not binary path) to control access, so:

- `copilot login` (Homebrew binary) writes the token to Keychain with an
ACL scoped to the GitHub Developer ID
- `copilot --headless` (bundled binary) reads the token via keytar —
**same Developer ID = no password prompt**
- PolyPilot's `/usr/bin/security` calls used a **different signer**
(Apple's built-in tool), which triggered the ACL mismatch and the
password dialogs
- Clicking Allow/Deny on those dialogs **rewrote the ACL**, breaking the
server's native keytar access — creating the regression spiral

This was verified via `codesign -dvvv` on both binaries — they share the
same Identifier, Authority chain, and team certificate. The Keychain
entry's partition list (visible in `securityd` logs) confirms it uses
team-id-based access control, not path-based.

**Conclusion:** The server has always been able to self-authenticate.
The only thing that broke it was PolyPilot calling `/usr/bin/security`,
which used a different code-signing identity and corrupted the ACL.
Removing those calls is the correct fix.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
## Problem

After PR PureWeen#446 merged, a user reported: **"PP now restarts every hour or
so, and asks for my login password to authenticate to copilot-cli"**

## Root Cause Analysis

Three interacting bugs create a recurring password prompt cycle:

### Bug 1: Preemptive Keychain read at startup
`InitializeAsync` (line 932) called `ResolveGitHubTokenForServer()`
unconditionally in Persistent mode. This runs `security
find-generic-password -s "copilot-cli" -w`, which triggers a **macOS
Keychain ACL password dialog** because PolyPilot is not in the ACL for
the `copilot-cli` entry (only the terminal `copilot` binary that created
it is). Every user saw this dialog on every app launch — even users
whose server could self-authenticate fine.

### Bug 2: Polling loop re-reads Keychain every 10 seconds
The auth polling loop at `Utilities.cs:916` called
`ResolveGitHubTokenForServer()` whenever it detected auth went from
false → true. Since polling runs every 10s, this means:
- If the Keychain dialog **times out** (3s `RunProcessWithTimeout`),
token comes back `null` → server starts without auth → polling detects
no auth → **another dialog in 10 seconds** (tight loop)
- If the user clicks Allow, the token is static and will expire

### Bug 3: Static token expiration creates hourly cycle
The `gho_*` OAuth token forwarded as `COPILOT_GITHUB_TOKEN` env var is a
**static snapshot**. The copilot CLI normally refreshes tokens via
keytar.node, but an env var bypasses that refresh mechanism. When the
token expires (~1h, aligned with `WatchdogMaxProcessingTimeSeconds =
3600`):
1. All sessions fail → watchdog fires after 2 consecutive timeouts
2. `TryRecoverPersistentServerAsync` restarts server with same stale
cached token
3. Server still unauthenticated → `CheckAuthStatusAsync` starts polling
4. Polling detects auth → calls `ResolveGitHubTokenForServer()` →
**Keychain dialog again**

Additionally, the `ResolveGitHubTokenForServer()` call in the polling
loop was the **one call site not wrapped in `Task.Run`** (the R5/R7
fixes covered `InitializeAsync` and `ReauthenticateAsync` but missed
this third site).

## Design Decisions

### Why not just remove Keychain reads entirely?
**Because it would regress the original PR PureWeen#446 fix.** The original
issue was users getting "Session was not created with authentication
info or custom provider" because the headless server binary cannot
access the Keychain (macOS ACL restriction — different binary path). If
we remove Keychain reads, those users have no way to authenticate —
`copilot login` writes to Keychain under the terminal binary's ACL, so
the headless server still can't read it. The Keychain read is the **only
fix** for users without `gh` CLI or env vars set.

### Why split into `ResolveGitHubTokenFromEnv` vs
`ResolveGitHubTokenForServer`?
To make the safety boundary explicit in the API:
- `ResolveGitHubTokenFromEnv()` — **always safe**, no subprocess, no
prompt, instant. Used at startup.
- `ResolveGitHubTokenForServer()` — **dangerous**, may trigger Keychain
dialog. Only called on explicit user action or after confirmed auth
failure.

The doc comments on `ResolveGitHubTokenForServer` now include a ⚠️
warning and reference the skill invariants (INV-A1, INV-A2) to prevent
future agents from calling it preemptively.

### Why lazy resolution in `CheckAuthStatusAsync` instead of just
showing the banner?
For users whose server genuinely can't self-authenticate (the original
PR PureWeen#446 users), showing a banner and telling them to run `copilot login`
doesn't help — they've already done that. The Keychain entry exists; the
server just can't read it. Lazy resolution means:
1. Server starts without token (no prompt)
2. `CheckAuthStatusAsync` detects auth failure
3. **First time only** (`_resolvedGitHubToken == null`): resolves full
chain including Keychain → one password dialog → caches token → restarts
server
4. If the restart works → user is authenticated with zero manual
intervention after the one dialog
5. On subsequent failures (token expiry): uses cached token, **no new
dialog**

### Why does the polling loop no longer re-resolve?
The polling loop runs every 10 seconds. Re-reading the Keychain on every
cycle means a password dialog every 10 seconds if the user doesn't
respond. Instead, the polling loop now uses `_resolvedGitHubToken`
(already cached from the lazy resolution or from `ReauthenticateAsync`).
Only the explicit "Re-authenticate" button triggers a fresh Keychain
read — that's an intentional user action where a password prompt is
expected.

## Changes

| File | Change |
|------|--------|
| `CopilotService.Utilities.cs` | Split `ResolveGitHubTokenForServer`
into env-only (`ResolveGitHubTokenFromEnv`) and full version |
| `CopilotService.Utilities.cs` | `CheckAuthStatusAsync`: lazy
full-chain resolve on first auth failure, auto-restart with token |
| `CopilotService.Utilities.cs` | Polling loop: use cached token only,
removed `ResolveGitHubTokenForServer()` call |
| `CopilotService.cs` | `InitializeAsync`: use
`ResolveGitHubTokenFromEnv()` (no Keychain, no prompt) |
| `ServerRecoveryTests.cs` | Add test for `ResolveGitHubTokenFromEnv` |
| `.claude/skills/auth-token-safety/SKILL.md` | New skill: 9 invariants
from PR PureWeen#446 regression analysis |

## Behavior Matrix

| User type | Before (PR PureWeen#446) | After (this PR) |
|-----------|------------------|-----------------|
| Server self-authenticates | ❌ Password dialog at every startup | ✅
Zero prompts |
| Keychain ACL issue, first launch | ❌ Password dialog at startup | ✅
One dialog after auth failure detected |
| Keychain ACL issue, token expires | ❌ Password dialog every hour | ✅
No dialog (uses cached token) |
| Keychain dialog timeout (3s) | ❌ Dialog every 10s (tight loop) | ✅ No
loop (polling uses cache) |
| User clicks Re-authenticate | ✅ Fresh Keychain read | ✅ Fresh Keychain
read (unchanged) |
| Has GH_TOKEN env var | ✅ No dialog | ✅ No dialog (env-only at startup)
|

## Tests
- 3059/3059 pass ✅
- New test: `ResolveGitHubTokenFromEnv_ReturnsNull_WhenNoEnvVarsSet`

## Related
- Fixes regression from PureWeen#446
- Adds `.claude/skills/auth-token-safety/SKILL.md` with 9 invariants to
prevent future regressions

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…eWeen#462)

## Problem

Every couple of hours, the user gets prompted to "allow copilot-cli" and
must enter their macOS login password ~5 times. This is a regression
from PR PureWeen#456 — that fix addressed the 10-second polling loop but missed
a second cache-clearing path.

## Root Cause

`TryRecoverPersistentServerAsync()` clears `_resolvedGitHubToken = null`
on every recovery cycle (line 1339). Recovery is triggered by:
- Watchdog timeouts (token expiry ~1-8h)
- Wake/sleep health check failures
- Auth polling success → server restart

After clearing, the next `CheckAuthStatusAsync()` call sees
`_resolvedGitHubToken == null` → enters the lazy Keychain resolution
path → spawns `security find-generic-password -s copilot-cli -w` →
**macOS password dialog**.

**Re-reading the Keychain is useless** — the stored token is a static
snapshot from `copilot login`. If it expired, re-reading returns the
same expired token. Only running `copilot login` again would write a new
one.

## Fix

Remove the `_resolvedGitHubToken = null` from
`TryRecoverPersistentServerAsync`. The cached token is still forwarded
to the new server via `tokenToForward`. Only two paths now clear the
cache:
- `ReconnectAsync` — explicit user action (settings change)
- `ReauthenticateAsync` — explicit user action (Re-authenticate button)

When the token expires, the auth banner appears and the user clicks
Re-authenticate → fresh Keychain read (correct — user-initiated).

## Changes

| File | Change |
|------|--------|
| `CopilotService.cs` | Remove `_resolvedGitHubToken = null` from
recovery path |
| `auth-token-safety/SKILL.md` | Update INV-A3 to reflect new invariant
|

## Tests
3064/3064 pass ✅

## Related
- Fixes regression from PureWeen#456 (which fixed regression from PureWeen#446)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…eWeen#465)

## Problem

PR PureWeen#446 added Keychain-reading code to help users whose headless server
couldn't self-authenticate. This caused a **4-PR regression chain**
(PureWeen#446PureWeen#456PureWeen#462PureWeen#463) of recurring macOS password dialogs:

1. `TryReadCopilotKeychainToken` spawned `/usr/bin/security
find-generic-password` for 3 service names sequentially — **3 password
dialogs per call**
2. Clicking Allow/Deny on those dialogs **rewrote the Keychain ACL**,
breaking the server's own native keytar access
3. The server fell back to its own `/usr/bin/security` calls → **more
dialogs every 1-2 hours**
4. PRs PureWeen#456, PureWeen#462, PureWeen#463 each fixed one trigger path but the core
approach was wrong

## Root Cause

The headless copilot server **authenticates on its own** at startup via
its native credential store. This has worked reliably across dozens of
worktree switches (different binary paths each time) without any
Keychain intervention from PolyPilot.

The original PR PureWeen#446 user issue ("Session was not created with
authentication info") was a server auth loss that should have been
solved by **restarting the server** — which
`TryRecoverPersistentServerAsync` already does — not by reading the
Keychain from the UI process.

## Fix

Remove all Keychain-reading code entirely (**-612 lines, +42 lines**):

| Removed | Why |
|---------|-----|
| `ResolveGitHubTokenForServer()` | Keychain + gh CLI reads — triggers
password dialogs |
| `TryReadCopilotKeychainToken()` | 3-service-name loop — 3× password
dialogs per call |
| `RunProcessWithTimeout()` | Only used by above |
| `_tokenResolutionLock` SemaphoreSlim | Guarded the lazy Keychain path
|
| Lazy resolution path in `CheckAuthStatusAsync` | The whole Keychain
auto-resolution mechanism |
| Sentinel logic (`_resolvedGitHubToken ??= ""`) | No longer needed
without the lazy path |

| Kept | Why |
|------|-----|
| `ResolveGitHubTokenFromEnv()` | Env vars are safe, no prompts |
| `CheckAuthStatusAsync` + auth banner | Correct — shows "run copilot
login" guidance |
| `TryRecoverPersistentServerAsync` | Correct — restarts server which
re-authenticates on its own |
| Re-authenticate button | Correct — restarts server to pick up fresh
`copilot login` credentials |

## For users who cannot self-authenticate

The auth banner says: _"run `copilot login` in a terminal, then click
Re-authenticate."_ This was the **original PR PureWeen#446 design** before
Keychain code was added during review rounds.

## Timeline of the regression

| PR | What it did | What went wrong |
|----|-------------|-----------------|
| PureWeen#446 | Added Keychain reads to forward token to server | Triggered
password dialogs; corrupted Keychain ACL |
| PureWeen#456 | Made Keychain reads lazy (only on first auth failure) | Still
fired on every server recovery cycle |
| PureWeen#462 | Stopped recovery from clearing the token cache | Token was
never set in the first place (no env var) |
| PureWeen#463 | Added sentinel on auth success | Server's own internal Keychain
reads still fired |
| **PureWeen#465** | **Removed all Keychain code** | **The right fix — server
self-authenticates** |

## Tests
3057/3057 pass ✅ (7 tests for deleted methods removed, 3 env var tests
kept)

## Why server self-authentication works

The copilot headless server binary bundled inside PolyPilot.app and the
Homebrew `copilot` binary are both signed with the **same GitHub
Developer ID certificate**. macOS Keychain ACLs use code-signing
identity (not binary path) to control access, so:

- `copilot login` (Homebrew binary) writes the token to Keychain with an
ACL scoped to the GitHub Developer ID
- `copilot --headless` (bundled binary) reads the token via keytar —
**same Developer ID = no password prompt**
- PolyPilot's `/usr/bin/security` calls used a **different signer**
(Apple's built-in tool), which triggered the ACL mismatch and the
password dialogs
- Clicking Allow/Deny on those dialogs **rewrote the ACL**, breaking the
server's native keytar access — creating the regression spiral

This was verified via `codesign -dvvv` on both binaries — they share the
same Identifier, Authority chain, and team certificate. The Keychain
entry's partition list (visible in `securityd` logs) confirms it uses
team-id-based access control, not path-based.

**Conclusion:** The server has always been able to self-authenticate.
The only thing that broke it was PolyPilot calling `/usr/bin/security`,
which used a different code-signing identity and corrupted the ACL.
Removing those calls is the correct fix.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant