Add keepalive ping to prevent server idle timeout (#396)#402
Conversation
The headless Copilot server kills sessions after ~35 minutes of inactivity. This causes multi-agent workers to lose their work mid-execution when tool runs take a long time. Add a background keepalive loop that pings the server every 15 minutes via CopilotClient.PingAsync(). This resets the server's connection-level idle timer and prevents session cleanup. The keepalive: - Starts after InitializeAsync and ReconnectAsync (non-demo/remote) - Stops on ReconnectAsync teardown and DisposeAsync - Skips pings in Demo/Remote mode (no headless server) - Logs [KEEPALIVE] to the diagnostic log for traceability - Catches all exceptions to avoid crashing the app Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #402 Review — Add keepalive ping to prevent server idle timeoutCI: No CI configured | Tests: ✅ 2716 passed, 0 failed | Existing reviews: None Methodology5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Consensus filter: only findings flagged by 2+ models included. 🟡 MODERATE — Two server-restart paths missing keepalive lifecycle (5/5 and 4/5 consensus)
Disposes the old client at line 1118 and creates a new one at line 1120, but never calls Fix: // Before line 1095 (_serverManager.StopServer())
StopKeepalivePing();
// After line 1128 (_lastRecoveryCompletedAt = DateTime.UtcNow)
StartKeepalivePing();
Same issue. Disposes the old client at line 1182 and creates a new one at line 1210, but neither Fix: // After line 1164 (ServerHealthNotice = null;)
StopKeepalivePing();
// After line 1233 (ReconcileOrganization())
StartKeepalivePing();🟢 MINOR —
|
| Finding | Severity | Consensus | File:Line |
|---|---|---|---|
TryRecoverPersistentServerAsync missing stop/start |
🟡 MODERATE | 4/5 | CopilotService.cs:1069 |
RestartServerAsync missing stop/start |
🟡 MODERATE | 5/5 | CopilotService.cs:1150 |
StartKeepalivePing non-atomic write |
🟢 MINOR | 3/5 | CopilotService.cs:531 |
| No keepalive tests | 🟢 MINOR | 3/5 | PolyPilot.Tests/ |
The core logic of the PR (loop design, cancellation, error handling, call sites in InitializeAsync/ReconnectAsync/DisposeAsync) is correct. The two MODERATE findings are real functional bugs where the feature is silently disabled after server restart/recovery — the fix is adding 2 lines to each of the two methods.
⚠️ Request changes — add keepalive to RestartServerAsync and TryRecoverPersistentServerAsync
…ePing - TryRecoverPersistentServerAsync: stop before server kill, start after successful recovery - RestartServerAsync: stop at entry, start after sessions restored - StartKeepalivePing: use Interlocked.Exchange instead of plain assignment to prevent race with concurrent DisposeAsync Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #402 Round 2 Review — Keepalive PingFix commit: "Fix review: keepalive in restart/recovery paths, atomic StartKeepalivePing" Previous Findings Status (5/5 model consensus)
All 10 Stop/Start call sites now form correct pairs across , , , , and . The unguarded in the two new restart paths is safe — both methods have early-returns for non-Persistent mode, and is guaranteed non-null on the success path. also null-checks each iteration as a safety net. ✅ ApproveAll blocking issues resolved. The remaining gap (F4 — no tests) is low-severity and doesn't block merge. |
…eWeen#402) ## Problem The headless Copilot server kills sessions after ~35 minutes of inactivity. Multi-agent workers that run long tool executions (PR reviews, code analysis) get killed mid-work, losing their results. This was observed repeatedly: - `PR Review Squad-worker-1`: Killed after 23 min during PR review (16:15:53 `session.shutdown`) - `PP- IC Things` workers: Multiple sessions killed overnight - `ci-agentic`: Server killed session at 20:13, user sent prompt at 20:14 to dead session ## Root Cause The headless server has an idle timeout (~35 min) that kills sessions when no client messages arrive. During long tool executions, the client doesn't send any RPC messages — the SDK handles tool dispatch internally — so the server thinks the connection is idle. ## Fix Add a background keepalive loop that calls `CopilotClient.PingAsync("keepalive")` every 15 minutes. This sends an RPC message to the server, resetting its idle timer. ``` [KEEPALIVE] Ping sent to headless server ← every 15 min ``` **Lifecycle:** - Starts after `InitializeAsync` and `ReconnectAsync` (non-demo, non-remote modes only) - Stops on `ReconnectAsync` teardown and `DisposeAsync` - Skips pings in Demo/Remote mode (no headless server) - All exceptions caught — will not crash the app ## Safety - Uses existing `CopilotClient.PingAsync()` SDK API — no new APIs - Single background task with `CancellationToken` — clean shutdown - No state mutations — only sends a ping and logs - `[KEEPALIVE]` tag added to diagnostic log filter for traceability - All 2716 tests pass Closes PureWeen#396 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
The headless Copilot server kills sessions after ~35 minutes of inactivity. Multi-agent workers that run long tool executions (PR reviews, code analysis) get killed mid-work, losing their results. This was observed repeatedly:
PR Review Squad-worker-1: Killed after 23 min during PR review (16:15:53session.shutdown)PP- IC Thingsworkers: Multiple sessions killed overnightci-agentic: Server killed session at 20:13, user sent prompt at 20:14 to dead sessionRoot Cause
The headless server has an idle timeout (~35 min) that kills sessions when no client messages arrive. During long tool executions, the client doesn't send any RPC messages — the SDK handles tool dispatch internally — so the server thinks the connection is idle.
Fix
Add a background keepalive loop that calls
CopilotClient.PingAsync("keepalive")every 15 minutes. This sends an RPC message to the server, resetting its idle timer.Lifecycle:
InitializeAsyncandReconnectAsync(non-demo, non-remote modes only)ReconnectAsyncteardown andDisposeAsyncSafety
CopilotClient.PingAsync()SDK API — no new APIsCancellationToken— clean shutdown[KEEPALIVE]tag added to diagnostic log filter for traceabilityCloses #396