Skip to content

Add keepalive ping to prevent server idle timeout (#396)#402

Merged
PureWeen merged 2 commits intomainfrom
fix/session-keepalive
Mar 17, 2026
Merged

Add keepalive ping to prevent server idle timeout (#396)#402
PureWeen merged 2 commits intomainfrom
fix/session-keepalive

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

The headless Copilot server kills sessions after ~35 minutes of inactivity. Multi-agent workers that run long tool executions (PR reviews, code analysis) get killed mid-work, losing their results. This was observed repeatedly:

  • PR Review Squad-worker-1: Killed after 23 min during PR review (16:15:53 session.shutdown)
  • PP- IC Things workers: Multiple sessions killed overnight
  • ci-agentic: Server killed session at 20:13, user sent prompt at 20:14 to dead session

Root Cause

The headless server has an idle timeout (~35 min) that kills sessions when no client messages arrive. During long tool executions, the client doesn't send any RPC messages — the SDK handles tool dispatch internally — so the server thinks the connection is idle.

Fix

Add a background keepalive loop that calls CopilotClient.PingAsync("keepalive") every 15 minutes. This sends an RPC message to the server, resetting its idle timer.

[KEEPALIVE] Ping sent to headless server    ← every 15 min

Lifecycle:

  • Starts after InitializeAsync and ReconnectAsync (non-demo, non-remote modes only)
  • Stops on ReconnectAsync teardown and DisposeAsync
  • Skips pings in Demo/Remote mode (no headless server)
  • All exceptions caught — will not crash the app

Safety

  • Uses existing CopilotClient.PingAsync() SDK API — no new APIs
  • Single background task with CancellationToken — clean shutdown
  • No state mutations — only sends a ping and logs
  • [KEEPALIVE] tag added to diagnostic log filter for traceability
  • All 2716 tests pass

Closes #396

The headless Copilot server kills sessions after ~35 minutes of
inactivity. This causes multi-agent workers to lose their work
mid-execution when tool runs take a long time.

Add a background keepalive loop that pings the server every 15
minutes via CopilotClient.PingAsync(). This resets the server's
connection-level idle timer and prevents session cleanup.

The keepalive:
- Starts after InitializeAsync and ReconnectAsync (non-demo/remote)
- Stops on ReconnectAsync teardown and DisposeAsync
- Skips pings in Demo/Remote mode (no headless server)
- Logs [KEEPALIVE] to the diagnostic log for traceability
- Catches all exceptions to avoid crashing the app

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #402 Review — Add keepalive ping to prevent server idle timeout

CI: No CI configured | Tests: ✅ 2716 passed, 0 failed | Existing reviews: None


Methodology

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Consensus filter: only findings flagged by 2+ models included.


🟡 MODERATE — Two server-restart paths missing keepalive lifecycle (5/5 and 4/5 consensus)

PolyPilot/Services/CopilotService.cs:1069TryRecoverPersistentServerAsync (4/5 models)

Disposes the old client at line 1118 and creates a new one at line 1120, but never calls StopKeepalivePing() before disposal or StartKeepalivePing() after success. After a server recovery, the new connection has no keepalive pings — it will be killed by the server's idle timeout again in ~35 minutes, defeating the purpose of this PR. The old loop technically continues running but silently skips (because it does continue on _client == null), however it eventually pings the new client with an unsynchronized timer.

Fix:

// Before line 1095 (_serverManager.StopServer())
StopKeepalivePing();

// After line 1128 (_lastRecoveryCompletedAt = DateTime.UtcNow)
    StartKeepalivePing();

PolyPilot/Services/CopilotService.cs:1150RestartServerAsync (5/5 models)

Same issue. Disposes the old client at line 1182 and creates a new one at line 1210, but neither StopKeepalivePing() nor StartKeepalivePing() are called anywhere in this method. After a server restart (triggered by stale native modules), sessions will timeout again.

Fix:

// After line 1164 (ServerHealthNotice = null;)
StopKeepalivePing();

// After line 1233 (ReconcileOrganization())
    StartKeepalivePing();

🟢 MINOR — StartKeepalivePing atomicity inconsistency (3/5 consensus)

PolyPilot/Services/CopilotService.cs:531-535

StopKeepalivePing uses Interlocked.Exchange (correct), but StartKeepalivePing does a plain _keepaliveCts = cts write. The main callers (InitializeAsync, ReconnectAsync) are serialized by _clientReconnectLock, so the race is unlikely in practice. However, DisposeAsync calls StopKeepalivePing without acquiring the lock. Narrow window: if DisposeAsync fires between the StopKeepalivePing() call and the _keepaliveCts = cts assignment in StartKeepalivePing, the new CTS is never cancelled and the loop leaks until the process exits.

Fix: use Interlocked.Exchange in StartKeepalivePing too and cancel/dispose the displaced value.


🟢 MINOR — No tests for keepalive lifecycle (3/5 consensus)

No test coverage exists for:

  • Keepalive loop exits on cancellation
  • StopKeepalivePing in DisposeAsync actually cancels the task
  • RestartServerAsync/TryRecoverPersistentServerAsync restart keepalive after reconnect (this would have caught the MODERATE bugs above)

Suggested test cases in CopilotServiceInitializationTests.cs:

  • StartKeepalivePing_CancelTokenStopsLoop
  • RestartServerAsync_RestartsKeepalive
  • TryRecoverPersistentServerAsync_RestartsKeepalive

Summary

Finding Severity Consensus File:Line
TryRecoverPersistentServerAsync missing stop/start 🟡 MODERATE 4/5 CopilotService.cs:1069
RestartServerAsync missing stop/start 🟡 MODERATE 5/5 CopilotService.cs:1150
StartKeepalivePing non-atomic write 🟢 MINOR 3/5 CopilotService.cs:531
No keepalive tests 🟢 MINOR 3/5 PolyPilot.Tests/

The core logic of the PR (loop design, cancellation, error handling, call sites in InitializeAsync/ReconnectAsync/DisposeAsync) is correct. The two MODERATE findings are real functional bugs where the feature is silently disabled after server restart/recovery — the fix is adding 2 lines to each of the two methods.

⚠️ Request changes — add keepalive to RestartServerAsync and TryRecoverPersistentServerAsync

…ePing

- TryRecoverPersistentServerAsync: stop before server kill, start
  after successful recovery
- RestartServerAsync: stop at entry, start after sessions restored
- StartKeepalivePing: use Interlocked.Exchange instead of plain
  assignment to prevent race with concurrent DisposeAsync

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #402 Round 2 Review — Keepalive Ping

Fix commit: "Fix review: keepalive in restart/recovery paths, atomic StartKeepalivePing"
Tests: ✅ 2716 passed, 0 failed | New findings: None


Previous Findings Status (5/5 model consensus)

Finding Status Details
🟡 F1 — RestartServerAsync missing keepalive FIXED StopKeepalivePing() at :1171, StartKeepalivePing() at :1241. Error/early-return paths correctly skip start (no working server).
🟡 F2 — TryRecoverPersistentServerAsync missing keepalive FIXED StopKeepalivePing() at :1099, StartKeepalivePing() at :1133 (success path only). Failure and catch paths leave keepalive stopped.
🟢 F3 — StartKeepalivePing non-atomic write FIXED Now uses Interlocked.Exchange(ref _keepaliveCts, cts) before launching the loop — consistent with StopKeepalivePing. Concurrent stop after Exchange is handled gracefully via cancellation.
🟢 F4 — No keepalive test coverage ⚠️ STILL PRESENT Zero keepalive tests added. Low severity given the straightforward logic, but start/stop/restart lifecycle remains untested.

All 10 Stop/Start call sites now form correct pairs across , , , , and . The unguarded in the two new restart paths is safe — both methods have early-returns for non-Persistent mode, and is guaranteed non-null on the success path. also null-checks each iteration as a safety net.


✅ Approve

All blocking issues resolved. The remaining gap (F4 — no tests) is low-severity and doesn't block merge.

@PureWeen PureWeen merged commit 5f2b398 into main Mar 17, 2026
@PureWeen PureWeen deleted the fix/session-keepalive branch March 17, 2026 17:54
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…eWeen#402)

## Problem

The headless Copilot server kills sessions after ~35 minutes of
inactivity. Multi-agent workers that run long tool executions (PR
reviews, code analysis) get killed mid-work, losing their results. This
was observed repeatedly:

- `PR Review Squad-worker-1`: Killed after 23 min during PR review
(16:15:53 `session.shutdown`)
- `PP- IC Things` workers: Multiple sessions killed overnight  
- `ci-agentic`: Server killed session at 20:13, user sent prompt at
20:14 to dead session

## Root Cause

The headless server has an idle timeout (~35 min) that kills sessions
when no client messages arrive. During long tool executions, the client
doesn't send any RPC messages — the SDK handles tool dispatch internally
— so the server thinks the connection is idle.

## Fix

Add a background keepalive loop that calls
`CopilotClient.PingAsync("keepalive")` every 15 minutes. This sends an
RPC message to the server, resetting its idle timer.

```
[KEEPALIVE] Ping sent to headless server    ← every 15 min
```

**Lifecycle:**
- Starts after `InitializeAsync` and `ReconnectAsync` (non-demo,
non-remote modes only)
- Stops on `ReconnectAsync` teardown and `DisposeAsync`
- Skips pings in Demo/Remote mode (no headless server)
- All exceptions caught — will not crash the app

## Safety

- Uses existing `CopilotClient.PingAsync()` SDK API — no new APIs
- Single background task with `CancellationToken` — clean shutdown
- No state mutations — only sends a ping and logs
- `[KEEPALIVE]` tag added to diagnostic log filter for traceability
- All 2716 tests pass

Closes PureWeen#396

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Send keep-alive pings to prevent server idle timeout killing sessions

1 participant