Skip to content

feat(diagnostics): zero-idle session capture bundle for SDK bug investigation #359

@btessiau

Description

@btessiau

Problem

Issue #299 identifies that some Copilot CLI sessions never emit session.idle after a turn completes. PolyPilot has solid mitigations (TurnEnd→Idle fallback timer, watchdog, content flush), but none of them capture why it happened. This makes it difficult to file an accurate upstream SDK issue or confirm whether the root cause is in the CLI, the JSON-RPC transport, or PolyPilot itself.

The upstream analysis in copilot-sdk#794 identified two theoretical vulnerabilities but has not reproduced either in the real CLI. We need better diagnostic data from PolyPilot to narrow the root cause.

Proposed Solution: Zero-Idle Capture Bundle

When PolyPilot detects a missing session.idle event (i.e., the TurnEnd→Idle fallback fires), automatically capture a diagnostic snapshot that can be used to analyze what happened.

1. Capture bundle on fallback trigger

When the IDLE-FALLBACK fires (lines 519-524 in CopilotService.Events.cs), write a JSON capture file to ~/.polypilot/zero-idle-captures/:

~/.polypilot/zero-idle-captures/
  capture_2026-03-11T22-45-00_sess-abc123.json

Each capture contains:

  • Session metadata: session ID, name, model, history size, group
  • Processing state snapshot: IsProcessing, ProcessingPhase, ActiveToolCallCount, HasUsedToolsThisTurn, ProcessingGeneration, LastEventAtTicks age
  • Event sequence: Last 50 events from events.jsonl (parsed via existing ParseEventLogFile())
  • Timing: When TurnEnd was received, how long the fallback waited, whether tools were used
  • Concurrency context: Total active sessions, total sessions with IsProcessing=true

2. All-events tracing to diagnostics log

Currently, HandleSessionEvent only logs 4 event types (TurnStart, TurnEnd, Idle, Error) to event-diagnostics.log. For zero-idle investigation, knowing the exact last event before silence is critical.

Add a diagnostic setting EnableVerboseEventTracing (default: false) that, when enabled, logs every SDK event type to the diagnostics log. This reveals whether the last event was ToolExecutionComplete, AssistantMessage, SessionCompactionComplete, etc.

3. Per-session event counter

Track a simple EventCountThisTurn counter on SessionState that increments on every event in HandleSessionEvent. When the fallback fires, include this count in the capture. This answers: "Did the session receive 3 events or 300 before going silent?"

Why This Helps

What we learn How it helps
Last event type before silence Narrows which CLI code path failed to emit idle
events.jsonl final entries Shows what the CLI wrote vs what PP received (transport drop?)
Tool activity at fallback time Correlates with CLI's processQueuedItems post-loop code
History size / concurrent sessions Identifies environmental triggers (load, memory pressure)
Frequency data Is it 1% of turns? 20%? Specific to certain models?

Implementation Details

New/Modified Files

  • CopilotService.Events.cs — Capture logic at fallback firing points + all-events tracing
  • CopilotService.csEventCountThisTurn field on SessionState, capture writer method
  • ConnectionSettings.csEnableVerboseEventTracing toggle
  • PolyPilot.Tests/ZeroIdleCaptureTests.cs — Unit tests for capture format and field population

Storage

  • Location: ~/.polypilot/zero-idle-captures/
  • Format: JSON (one file per capture, human-readable)
  • Retention: Keep last 100 captures, auto-prune on startup
  • Size: ~5-10 KB per capture (50 event lines + metadata)

Performance Impact

  • Zero cost in normal operation — Capture only fires when the fallback triggers (rare)
  • Verbose tracing — Opt-in via settings toggle, adds one Debug() call per event (~negligible)

Acceptance Criteria

  • Capture file written on every IDLE-FALLBACK trigger
  • Capture includes all fields listed above
  • Verbose event tracing gated behind settings toggle
  • No performance impact when tracing is disabled
  • Tests for capture format, field population, retention
  • Captures are never committed to git (add to .gitignore if needed)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions