feat(diagnostics): zero-idle session capture bundle for SDK bug investigation#360
Conversation
…tigation When the TurnEnd→Idle fallback fires (SessionIdleEvent missing, SDK bug PureWeen#299), automatically write a diagnostic capture file to ~/.polypilot/zero-idle-captures/ containing: - Session metadata (ID, name, model, history size) - Processing state snapshot (phase, tool counts, event count) - Timing (TurnEnd age, last event age, fallback wait duration) - Concurrency context (total/processing session counts) - Last 50 events from events.jsonl (SDK-side event log) Also adds: - EnableVerboseEventTracing setting: when true, logs ALL SDK event types to event-diagnostics.log (not just 4 lifecycle events) - EventCountThisTurn counter on SessionState - TurnEndReceivedAtTicks timestamp for fallback wait measurement - Startup purge: keeps last 100 captures, deletes older ones - 8 tests for purge logic, settings round-trip, API surface Closes PureWeen#359 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
btessiau
left a comment
There was a problem hiding this comment.
🔍 PR #360 Review — feat(diagnostics): zero-idle session capture bundle for SDK bug investigation
CI Status:
Prior reviews: None
Summary
PR adds diagnostic capture files when the TurnEnd→Idle fallback fires (SDK bug #299). New fields EventCountThisTurn and TurnEndReceivedAtTicks on SessionState, written atomically; CaptureZeroIdleDiagnostics() writes a JSON snapshot to ~/.polypilot/zero-idle-captures/; startup purge keeps last 100. EnableVerboseEventTracing setting enables full per-event logging.
Findings
🟡 MODERATE — CopilotService.Events.cs:2272 — sessionId null dereference fails capture silently
var sessionId = state.Info.SessionId; // string? — can be null
...
var fileName = $"capture_..._{sessionId[..Math.Min(8, sessionId.Length)]}.json";
// ^^^^^^^^^^^^ NullReferenceException if sessionId is nullAgentSessionInfo.SessionId is string?. It's assigned from SessionStartEvent at line 599, but on resumed/restored sessions the SDK may not replay SessionStartEvent — the session can go through an entire turn without it. If SessionId is null when the fallback fires, CaptureZeroIdleDiagnostics throws NullReferenceException, which is caught by the outer catch block and silently drops the capture with only a Console.Error message. No crash, but the diagnostic is lost for exactly the resumed sessions most likely to exhibit the bug.
Fix: var sessionId = state.Info.SessionId ?? "unknown";
🟢 MINOR — CopilotService.Events.cs:2239 — History.Count read on background thread
["history_size"] = state.Info.History.Count,CaptureZeroIdleDiagnostics runs inside Task.Run (the fallback timer task). state.Info.History is a plain List<ChatMessage>. The UI thread may be calling History.Add() (in CompleteResponse or event handlers) concurrently. In practice the outer catch protects against exceptions, and this is diagnostic-only, but the count could be stale or the read could throw InvalidOperationException on an unlucky resize. The rest of the capture fields correctly use Volatile.Read / Interlocked.Read for their cross-thread reads.
Fix: ["history_size"] = Volatile.Read(ref state.Info.MessageCount) — MessageCount is already maintained as a volatile-compatible int alongside History and is the established pattern in this codebase.
What Looks Correct ✅
New SessionState fields: EventCountThisTurn (int) and TurnEndReceivedAtTicks (long) — both accessed via Interlocked.Increment, Interlocked.Exchange, Interlocked.Read, Volatile.Read. Reset in both SendPromptAsync (line 2436+) and CompleteResponse (line 895+). Consistent with the existing pattern.
TurnEndReceivedAtTicks write: Set at AssistantTurnEndEvent (line 476) inside HandleSessionEvent — correct place, since this is the event that arms the fallback timer.
CaptureZeroIdleDiagnostics call placement: Called after all guards pass and immediately before InvokeOnUI(() => CompleteResponse(...)). The capture is synchronous file I/O on the fallback Task.Run thread — it doesn't block the UI thread, and the fallback has already waited TurnEndIdleFallbackMs (+ tools wait if applicable) so the additional file I/O latency is negligible.
PurgeOldCaptures: Sorts by filename descending (ISO timestamp format guarantees lexicographic = chronological), skips keepCount newest, deletes the rest. Correct. Wired in MauiProgram.cs at startup.
Test isolation: TestSetup.cs adds CopilotService.SetCaptureDirForTesting(...) to the module initializer. ZeroIdleCaptureTests uses [Collection("ZeroIdleCapture")] + IDisposable with per-test Guid-named temp dirs. SetBaseDirForTesting also nulls _zeroIdleCaptureDir so the path resolves fresh in each test run.
EnableVerboseEventTracing: Added to ConnectionSettings with = false default. _currentSettings?.EnableVerboseEventTracing is correctly null-safe. No existing tests broken.
[ZERO-IDLE] log tag: Added to the Debug method's allow-list (line 479) so it appears in event-diagnostics.log — consistent with the diagnostic tag system.
ChatExperienceSafetyTests.cs:842: The 5000 → 6000 widening is safe — it just extends the substring scan to catch SendingFlag when CompleteResponse grows with more code.
Test Coverage
ZeroIdleCaptureTests.cs (8 tests):
PurgeOldCaptures_KeepsOnlyMostRecent/NoOpWhenFewFiles/NoOpWhenDirMissing— ✅ purge logicEnableVerboseEventTracing_DefaultsToFalse/RoundTripsViaJson— ✅ settings serialization- 3 reflection guard tests — ✅ verify internal static methods exist
Coverage gap: No test exercises CaptureZeroIdleDiagnostics end-to-end (creates capture file, verifies JSON content). This is harder to test (requires a live SessionState), but the capture format and purge retention are the most important things to guard against regressions. The existing tests are reasonable given the test architecture constraints.
Verdict: ⚠️ Request Changes
One targeted fix before merge:
CopilotService.Events.cs:2204— null-guardsessionId: Changevar sessionId = state.Info.SessionId;tovar sessionId = state.Info.SessionId ?? "unknown";to prevent silent capture loss on resumed sessions.
The History.Count threading issue is low risk (diagnostic-only, fully wrapped in try/catch) — consider state.Info.MessageCount as a cleaner fix but it's not blocking.
1. sessionId ?? "unknown" prevents NullReferenceException on resumed sessions where SessionStartEvent was not replayed 2. state.Info.MessageCount instead of History.Count avoids reading a plain List from the background fallback thread Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
btessiau
left a comment
There was a problem hiding this comment.
Both items addressed in 6c7e56f:
🟡 MODERATE — sessionId null dereference → Fixed. var sessionId = state.Info.SessionId ?? "unknown" — captures on resumed sessions now work correctly instead of silently dropping.
🟢 MINOR — History.Count on background thread → Fixed. Switched to state.Info.MessageCount which is the established thread-safe pattern in the codebase.
Build clean, 8/8 tests pass.
btessiau
left a comment
There was a problem hiding this comment.
🔄 PR #360 Re-Review (Round 2) — Zero-Idle Capture Diagnostics
CI Status:
Fix commit reviewed: 6c7e56f
Previous Findings Status
| # | Severity | Finding | Status |
|---|---|---|---|
| 1 | 🟡 MODERATE | sessionId null dereference — state.Info.SessionId used in Path.Combine and string slicing without null guard |
✅ FIXED — state.Info.SessionId ?? "unknown" |
| 2 | 🟢 MINOR | TurnEndReceivedAtTicks double-read without Interlocked.Read on background thread |
Finding #2 — STILL PRESENT
CopilotService.Events.cs:2206–2207
var turnEndAge = state.TurnEndReceivedAtTicks > 0
? (now - new DateTime(state.TurnEndReceivedAtTicks, DateTimeKind.Utc)).TotalSeconds
: -1;TurnEndReceivedAtTicks is a long field written via Interlocked.Exchange on the event-handler thread and read here on the background TurnEnd→Idle fallback Task. On ARM 32-bit (Android), 64-bit long reads are not guaranteed atomic without a memory barrier — the two bare reads (comparison and value) could observe different values if a write races between them, producing a negative or wildly wrong turnEndAge.
Compare to the correct adjacent pattern for LastEventAtTicks one line below:
var lastEventAge = (now - new DateTime(Interlocked.Read(ref state.LastEventAtTicks), DateTimeKind.Utc)).TotalSeconds;Fix:
var turnEndTicks = Interlocked.Read(ref state.TurnEndReceivedAtTicks);
var turnEndAge = turnEndTicks > 0
? (now - new DateTime(turnEndTicks, DateTimeKind.Utc)).TotalSeconds
: -1;The fix commit addressed History.Count → MessageCount (a separate thread-safety issue on the List<ChatMessage>) but did not address the TurnEndReceivedAtTicks double-read.
New Findings
None. The fix commit is otherwise clean. sessionId ?? "unknown" correctly prevents the silent capture drop on resumed sessions, and MessageCount is the right thread-safe substitute for History.Count.
Recommendation: ⚠️ Request changes (one remaining fix)
Apply Interlocked.Read for TurnEndReceivedAtTicks at CopilotService.Events.cs:2206–2207 using the same pattern as LastEventAtTicks. One-line fix; everything else is ready.
…read Bare long read is non-atomic on 32-bit ARM (Android). Use the same Interlocked.Read pattern as the adjacent LastEventAtTicks read. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
btessiau
left a comment
There was a problem hiding this comment.
🔄 PR #360 Re-Review (Round 3) — Zero-Idle Capture Diagnostics
CI Status:
Commits reviewed: 0ee8111 (original) → 6c7e56f (Round 2 fixes) → 46b3f72 (Round 3 fix)
Previous Findings Status
| # | Finding | Round 3 Status |
|---|---|---|
| 1 | 🟡 sessionId null dereference in capture filename |
✅ FIXED (Round 2) |
| 2 | 🟢 TurnEndReceivedAtTicks double bare read on background thread |
✅ FIXED |
Finding #2 fix confirmed (CopilotService.Events.cs:2206-2210):
var turnEndTicks = Interlocked.Read(ref state.TurnEndReceivedAtTicks); // single atomic read
var turnEndAge = turnEndTicks > 0
? (now - new DateTime(turnEndTicks, DateTimeKind.Utc)).TotalSeconds
: -1;Correctly snapshots to a local before both uses, matching the LastEventAtTicks pattern on the very next line. ✅
One Remaining 🟢 MINOR Item (non-blocking)
CopilotService.Events.cs:2277 — Debug log line still uses bare field reads:
Debug($"[ZERO-IDLE] ... (events={state.EventCountThisTurn}, historySize={state.Info.History.Count}, ...");state.EventCountThisTurn— bareintread (structured capture on line 2249 usesVolatile.Read)state.Info.History.Count— directList<T>.Countread from background thread (structured capture switched tostate.Info.MessageCount)
Both are in a debug-only log line, so no correctness or data-loss risk. The values may be slightly stale or inconsistent relative to the structured capture fields, which could cause minor confusion when correlating log output with capture files during investigation.
Verdict
✅ Approve — all blocking issues resolved. The remaining item is a 🟢 MINOR inconsistency in a debug-only log string with no impact on capture correctness or app behavior. Fine to merge when ready.
…tigation (PureWeen#360) ## Summary Adds automatic diagnostic capture when PolyPilot detects a missing `session.idle` event (SDK bug PureWeen#299). Instead of just firing the fallback and moving on, we now write a "black box recorder" snapshot that can be used to analyze what happened and help file accurate upstream SDK issues. Closes PureWeen#359 ## The Problem Issue PureWeen#299: some Copilot CLI sessions never emit `session.idle` after a turn completes. PolyPilot's mitigations (TurnEnd→Idle fallback, watchdog, content flush) handle the symptom, but **none capture _why_ it happened**. The upstream analysis in [copilot-sdk#794](github/copilot-sdk#794) identified theoretical vulnerabilities but has not reproduced either in the real CLI. ## What This PR Adds ### 1. Zero-Idle Capture Bundle When the IDLE-FALLBACK fires, a JSON capture file is written to `~/.polypilot/zero-idle-captures/`: ```json { "capture_timestamp": "2026-03-11T22:45:00.123Z", "trigger": "IDLE_FALLBACK", "session": { "session_id": "abc-123", "model": "claude-sonnet-4-20250514", "history_size": 42, "is_multi_agent": false }, "processing_state": { "is_processing": true, "processing_phase": 3, "active_tool_call_count": 0, "has_used_tools_this_turn": true, "event_count_this_turn": 87 }, "timing": { "turn_end_age_seconds": 34.12, "last_event_age_seconds": 34.05, "fallback_wait_ms": 34000 }, "concurrency": { "total_sessions": 5, "processing_sessions": 2 }, "events_jsonl_tail": [ {"timestamp": "22:44:26", "type": "tool.execution_complete", "detail": "..."}, {"timestamp": "22:44:26", "type": "assistant.turn_end", "detail": ""} ] } ``` ### 2. Verbose Event Tracing New setting: `EnableVerboseEventTracing` (default: `false`). When enabled, logs **every** SDK event type to `event-diagnostics.log` via `[EVT-TRACE]` tag. Currently only 4 lifecycle events are logged — this reveals the exact last event before silence. ### 3. Per-Turn Event Counter `EventCountThisTurn` increments on every SDK event in `HandleSessionEvent`, resets on `SendPromptAsync` and `CompleteResponse`. Answers: "Did the session receive 3 events or 300 before going silent?" ### 4. Startup Purge Keeps last 100 capture files, auto-deletes older ones on startup. ## How This Helps Investigate PureWeen#299 | Captured Data | What It Reveals | |---|---| | Last events from events.jsonl | Which CLI code path failed to emit idle | | Event count this turn | How much activity before silence | | Tool state at fallback time | Correlation with CLI's `processQueuedItems` post-loop | | History size + concurrent sessions | Environmental triggers | | Timing (TurnEnd age) | Exact fallback wait duration | ## Files Changed | File | Change | |------|--------| | `CopilotService.Events.cs` | `CaptureZeroIdleDiagnostics()`, `PurgeOldCaptures()`, event counter increment, verbose tracing, capture dir management | | `CopilotService.cs` | `EventCountThisTurn` + `TurnEndReceivedAtTicks` fields, reset in `SendPromptAsync`, capture dir in `SetBaseDirForTesting` | | `ConnectionSettings.cs` | `EnableVerboseEventTracing` toggle | | `MauiProgram.cs` | Startup purge call | | `TestSetup.cs` | Capture dir isolation | | `ChatExperienceSafetyTests.cs` | Bumped source scan window (5000→6000 chars) | | `ZeroIdleCaptureTests.cs` | **NEW** — 8 tests for purge, settings, API surface | ## Performance Impact - **Zero cost in normal operation** — capture only fires when the fallback triggers (rare) - **Verbose tracing** — opt-in, adds one `Debug()` call per event (~negligible) - **Capture write** — single JSON file (~5-10 KB), non-blocking --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds automatic diagnostic capture when PolyPilot detects a missing
session.idleevent (SDK bug #299). Instead of just firing the fallback and moving on, we now write a "black box recorder" snapshot that can be used to analyze what happened and help file accurate upstream SDK issues.Closes #359
The Problem
Issue #299: some Copilot CLI sessions never emit
session.idleafter a turn completes. PolyPilot's mitigations (TurnEnd→Idle fallback, watchdog, content flush) handle the symptom, but none capture why it happened. The upstream analysis in copilot-sdk#794 identified theoretical vulnerabilities but has not reproduced either in the real CLI.What This PR Adds
1. Zero-Idle Capture Bundle
When the IDLE-FALLBACK fires, a JSON capture file is written to
~/.polypilot/zero-idle-captures/:{ "capture_timestamp": "2026-03-11T22:45:00.123Z", "trigger": "IDLE_FALLBACK", "session": { "session_id": "abc-123", "model": "claude-sonnet-4-20250514", "history_size": 42, "is_multi_agent": false }, "processing_state": { "is_processing": true, "processing_phase": 3, "active_tool_call_count": 0, "has_used_tools_this_turn": true, "event_count_this_turn": 87 }, "timing": { "turn_end_age_seconds": 34.12, "last_event_age_seconds": 34.05, "fallback_wait_ms": 34000 }, "concurrency": { "total_sessions": 5, "processing_sessions": 2 }, "events_jsonl_tail": [ {"timestamp": "22:44:26", "type": "tool.execution_complete", "detail": "..."}, {"timestamp": "22:44:26", "type": "assistant.turn_end", "detail": ""} ] }2. Verbose Event Tracing
New setting:
EnableVerboseEventTracing(default:false). When enabled, logs every SDK event type toevent-diagnostics.logvia[EVT-TRACE]tag. Currently only 4 lifecycle events are logged — this reveals the exact last event before silence.3. Per-Turn Event Counter
EventCountThisTurnincrements on every SDK event inHandleSessionEvent, resets onSendPromptAsyncandCompleteResponse. Answers: "Did the session receive 3 events or 300 before going silent?"4. Startup Purge
Keeps last 100 capture files, auto-deletes older ones on startup.
How This Helps Investigate #299
processQueuedItemspost-loopFiles Changed
CopilotService.Events.csCaptureZeroIdleDiagnostics(),PurgeOldCaptures(), event counter increment, verbose tracing, capture dir managementCopilotService.csEventCountThisTurn+TurnEndReceivedAtTicksfields, reset inSendPromptAsync, capture dir inSetBaseDirForTestingConnectionSettings.csEnableVerboseEventTracingtoggleMauiProgram.csTestSetup.csChatExperienceSafetyTests.csZeroIdleCaptureTests.csPerformance Impact
Debug()call per event (~negligible)