Remove RESUME-ABORT: never abort sessions on resume#452
Conversation
In persistent mode the headless CLI server keeps running tools even while PolyPilot is down — tool results WILL arrive once we reconnect. The abort was killing legitimate long-running tool executions (builds, tests that run 15-30+ min without writing to events.jsonl). The watchdog already handles truly dead sessions via timeout (30-600s depending on state), making the abort redundant and destructive. Replaced both RESUME-ABORT and RESUME-SKIP-ABORT branches with a single RESUME-ACTIVE path that marks the session as processing and lets events flow naturally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Diagnostic filter: [RESUME-ABORT -> [RESUME-ACTIVE] + [RESUME-CHECK] (without this, RESUME-ACTIVE entries wouldn't appear in event-diagnostics.log) - Utilities.cs: [RESUME-ABORT] -> [RESUME-CHECK] in HasInterruptedToolExecution - RESUME-ACTIVE InvokeOnUI: add ProcessingGeneration capture/check per INV-3/INV-12 to prevent stale callback from re-arming IsProcessing after a user-initiated turn has already completed (race window: send -> complete -> stale callback) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
474db57 to
d466c7b
Compare
🤖 Multi-Model Code Review — PR #452Remove RESUME-ABORT: never abort sessions on resume 🟡 MODERATE — Dead sessions now wait up to 600s instead of immediate resolutionFile: The code unconditionally sets Suggestion: Use 🟡 MODERATE — No tests for the behavioral changeFile: PR-wide High-risk state-machine change with no regression coverage added. Key untested scenarios:
🟢 POSITIVE — Generation guard is a good additionFile: The 📋 Additional Notes
Opus-only observation (not consensus, for consideration):The watchdog clears UI state but does NOT call Recommended action:
|
Use IsSessionStillProcessing() to choose watchdog flags, not to decide abort: - CLI active: HasUsedToolsThisTurn=true → 600s tool timeout (events will flow) - CLI stale: leave HasUsedToolsThisTurn=false → 30s quiescence (fast cleanup) Never abort in either case. Fixes stuck session UX regression where dead CLI sessions appeared processing for ~10 minutes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Code Review — PR #452 (v2)Remove RESUME-ABORT: never abort sessions on resume (updated diff) Previous findings status
🟡 MODERATE — SDK pending tool state not cleared on dead-CLI resume pathFile: The old code called Verified regression path:
Net effect: User's first message after a dead-CLI resume appears to hang for ~60s total (30s quiescence + 30s Case D) before the session recovers. Not permanent — but a meaningful UX regression vs the old immediate abort. Suggested fix: Call
🟡 MODERATE — No test coverage for new resume branchesFile: PR-wide No tests exist for
✅ Improvements confirmed (all 3 models agree)
📋 Notes
Recommended action:
|
Address review findings: 1. RESUME-QUIESCE (dead CLI) now calls AbortAsync() to clear SDK-internal pending tool expectations before entering 30s quiescence. Without this, the user's first message after watchdog clears IsProcessing would be silently dropped by the SDK (~60s total delay before recovery). 2. Add 4 tests for the resume branch selection logic: - ResumeActive_CliAlive_SetsHasUsedToolsThisTurn (600s tier) - ResumeQuiesce_CliDead_DoesNotSetHasUsedToolsThisTurn (30s tier) - ResumeActive_WatchdogTimeout_NotQuiescence (tier verification) - GenerationGuard_StaleCallback_IsNoOp (INV-3/INV-12 race) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Code Review — PR #452 (v3)Remove RESUME-ABORT: never abort sessions on resume (3rd revision) Previous findings status
🟡 MODERATE — AbortAsync race can push dead-CLI timeout from 30s to 120sFile: The RESUME-QUIESCE path calls When the watchdog first evaluates, Net effect: Dead-CLI sessions may take ~120s instead of 30s to clear if AbortAsync triggers an event. Still better than the old behavior (immediate abort could kill legitimate work), but worse than intended 30s. Suggested fix: Reset 🟢 MINOR — Tests 3 & 4 test local variables, not production codeFile:
Tests 1 & 2 ( 🟢 MINOR — 30s "Working…" UX change on dead-CLI resumeFile: Old code: CLI dead → This is an intentional design choice (documented in comments) and a reasonable tradeoff for unified processing state management. Noting for awareness. ✅ Improvements confirmed (all 3 models)
📋 Notes
Recommended action: ✅ Approve with minor suggestion The critical finding from v2 (SDK tool state) is fixed. The AbortAsync race (MODERATE) has a simple one-line fix ( |
…sume AbortAsync may trigger SDK events on a background thread before InvokeOnUI runs, setting HasReceivedEventsSinceResume=true. This defeats the 30s quiescence check, pushing dead-CLI timeout from 30s to 120s. Reset the flag inside InvokeOnUI after AbortAsync completes so the watchdog correctly uses the 30s resume quiescence path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…elaunch Two bugs prevented worker sessions from surviving app relaunch: 1. IsSessionStillProcessing used a whitelist of 'active' event types, but missed intermediate states like assistant.turn_end (between tool rounds), assistant.message, and tool.execution_end. Sessions caught between tool rounds were incorrectly detected as idle. Fix: Use a blacklist of terminal events (session.idle, session.error, session.shutdown) instead. Any non-terminal event means still active. 2. Actively-processing sessions were left as lazy placeholders with no SDK event handler (commit ff9d3d7). Events from the CLI never reached PolyPilot, the watchdog timed out after 600s, and multi-agent orchestrator TCSs were never completed. Fix: Add actively-processing sessions to eagerResumeCandidates so EnsureSessionConnectedAsync establishes the SDK connection. PR #452 already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts in-flight tool execution. Verified end-to-end: session processing during relaunch → new app detects active session → eager resume → SDK connected → events flow → session completes normally. Multi-agent orchestration resumes correctly via ResumeOrchestrationIfPendingAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…elaunch Two bugs prevented worker sessions from surviving app relaunch: 1. IsSessionStillProcessing used a whitelist of 'active' event types, but missed intermediate states like assistant.turn_end (between tool rounds), assistant.message, and tool.execution_end. Sessions caught between tool rounds were incorrectly detected as idle. Fix: Use a blacklist of terminal events (session.idle, session.error, session.shutdown) instead. Any non-terminal event means still active. 2. Actively-processing sessions were left as lazy placeholders with no SDK event handler (commit ff9d3d7). Events from the CLI never reached PolyPilot, the watchdog timed out after 600s, and multi-agent orchestrator TCSs were never completed. Fix: Add actively-processing sessions to eagerResumeCandidates so EnsureSessionConnectedAsync establishes the SDK connection. PR #452 already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts in-flight tool execution. Verified end-to-end: session processing during relaunch → new app detects active session → eager resume → SDK connected → events flow → session completes normally. Multi-agent orchestration resumes correctly via ResumeOrchestrationIfPendingAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…elaunch Two bugs prevented worker sessions from surviving app relaunch: 1. IsSessionStillProcessing used a whitelist of 'active' event types, but missed intermediate states like assistant.turn_end (between tool rounds), assistant.message, and tool.execution_end. Sessions caught between tool rounds were incorrectly detected as idle. Fix: Use a blacklist of terminal events (session.idle, session.error, session.shutdown) instead. Any non-terminal event means still active. 2. Actively-processing sessions were left as lazy placeholders with no SDK event handler (commit ff9d3d7). Events from the CLI never reached PolyPilot, the watchdog timed out after 600s, and multi-agent orchestrator TCSs were never completed. Fix: Add actively-processing sessions to eagerResumeCandidates so EnsureSessionConnectedAsync establishes the SDK connection. PR #452 already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts in-flight tool execution. Verified end-to-end: session processing during relaunch → new app detects active session → eager resume → SDK connected → events flow → session completes normally. Multi-agent orchestration resumes correctly via ResumeOrchestrationIfPendingAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…elaunch Two bugs prevented worker sessions from surviving app relaunch: 1. IsSessionStillProcessing used a whitelist of 'active' event types, but missed intermediate states like assistant.turn_end (between tool rounds), assistant.message, and tool.execution_end. Sessions caught between tool rounds were incorrectly detected as idle. Fix: Use a blacklist of terminal events (session.idle, session.error, session.shutdown) instead. Any non-terminal event means still active. 2. Actively-processing sessions were left as lazy placeholders with no SDK event handler (commit ff9d3d7). Events from the CLI never reached PolyPilot, the watchdog timed out after 600s, and multi-agent orchestrator TCSs were never completed. Fix: Add actively-processing sessions to eagerResumeCandidates so EnsureSessionConnectedAsync establishes the SDK connection. PR #452 already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts in-flight tool execution. Verified end-to-end: session processing during relaunch → new app detects active session → eager resume → SDK connected → events flow → session completes normally. Multi-agent orchestration resumes correctly via ResumeOrchestrationIfPendingAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary Removes the RESUME-ABORT logic that was killing legitimate long-running sessions on app restart. ## Problem In persistent mode, the headless CLI keeps running tools while PolyPilot is down. On resume, the old code aborted sessions with unmatched tool starts, destructively killing 15-30+ min tool runs. ## Fix Single `RESUME-ACTIVE` path: mark as processing, let events flow, watchdog handles dead sessions via timeout. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Removes the RESUME-ABORT logic that was killing legitimate long-running sessions on app restart.
Problem
In persistent mode, the headless CLI keeps running tools while PolyPilot is down. On resume, the old code aborted sessions with unmatched tool starts, destructively killing 15-30+ min tool runs.
Fix
Single
RESUME-ACTIVEpath: mark as processing, let events flow, watchdog handles dead sessions via timeout.