fix: Expire zombie subagents blocking IDLE-DEFER after 20 minutes#511
fix: Expire zombie subagents blocking IDLE-DEFER after 20 minutes#511
Conversation
When a subagent crashes or is orphaned, the CLI never fires SubagentCompleted/SubagentFailed. The IDLE-DEFER guard (which blocks premature session completion when background tasks are active) would then block the session indefinitely — reproducing the case where one of 8 subagents (started 40+ min prior) prevented the orchestrator from ever finishing. Tracks when IDLE-DEFER was first entered for the current turn (SubagentDeferStartedAt). HasActiveBackgroundTasks now accepts the defer start timestamp and returns false once SubagentZombieTimeoutMinutes (20) has elapsed, unblocking the session. Shells are never expired — their lifecycle is managed at the OS level. Also adds ZombieSubagentExpiryTests (13 cases) covering fresh, threshold, expired, mixed, and shell-only scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
🔍 Multi-Model Code Review — PR #511PR: fix: Expire zombie subagents blocking IDLE-DEFER after 20 minutes 🔴 CRITICAL — Stale
|
| Finding | Flagged by | Reason discarded |
|---|---|---|
??= set-before-check ordering (timestamp set even when no BG tasks) |
1/3 | Harmless — immediately cleared by CompleteResponse on fall-through. Other reviewers verified as non-issue. |
Test Coverage Assessment
The new ZombieSubagentExpiryTests.cs (195 lines, 14 tests) provides excellent coverage of the pure HasActiveBackgroundTasks function. However, the critical risk of this PR is state lifecycle management, not the helper logic — and no lifecycle tests exist for the new field.
⚠️ Recommendation: Request Changes
The core logic is sound and well-designed. The zombie expiry concept is correct, the shell exclusion is proper, the ??= first-write semantics are right within a single turn, and the test suite for the helper function is thorough.
However, the stale SubagentDeferStartedAt across turns is a 🔴 CRITICAL regression risk that can silently kill legitimate subagents after any abort/watchdog/error. This follows the exact pattern this codebase has regressed on 13+ times (companion fields not cleared in all paths).
Specific asks:
- Add
state.SubagentDeferStartedAt = null;alongside everystate.HasDeferredIdle = false;(16 locations) - Convert
DateTime?tolongticks withInterlockedoperations - Move
Console.WriteLineto useDebug()/ diagnostic log pipeline - Add at least one lifecycle test for cross-turn stale timestamp scenario
…logging Addresses all four review findings on PR #511: 🔴 CRITICAL — SubagentDeferStartedAt cleared in all 17 HasDeferredIdle paths SubagentDeferStartedAt and HasDeferredIdle are an inseparable companion pair. The field was only cleared in CompleteResponse; the other 16 paths (SendPromptAsync, AbortSessionAsync, watchdog abort, reconnect, error handlers, sibling reconnect, new-state reset, etc.) left it stale, causing zombie expiry to fire immediately on the next turn's first IDLE-DEFER. Added Interlocked.Exchange(ref ..., 0L) alongside every HasDeferredIdle = false assignment. 🟡 MODERATE — DateTime? → long ticks with Interlocked for thread safety DateTime? is a 12–16 byte struct; reads/writes are not atomic. Replaced with SubagentDeferStartedAtTicks (long, matching the TurnEndReceivedAtTicks pattern). Set via Interlocked.CompareExchange(0 → now) to preserve the first-write timestamp; cleared via Interlocked.Exchange(0); read via Interlocked.Read. 🟡 MODERATE — Console.WriteLine → Debug() at call site HasActiveBackgroundTasks is static and cannot call Debug(). Moved the zombie expiry log to the IDLE-DEFER call site with tag [IDLE-DEFER-ZOMBIE] so it routes through the diagnostics pipeline (~/.polypilot/event-diagnostics.log), consistent with all other session state transitions. 🟢 MINOR — Cross-turn stale timestamp test added StaleDeferTimestamp_FromPriorTurn_NewTurnShouldNotExpireAgents documents why clearing SubagentDeferStartedAtTicks at every turn boundary is required: passing a 25-min-old ticks value for a brand-new IDLE-DEFER causes immediate zombie expiry and kills live subagents. Tests updated to use long ticks throughout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the thorough review! All four findings addressed in 6d615e7: 🔴 CRITICAL — Stale 🟡 MODERATE — 🟡 MODERATE — 🟢 MINOR — Cross-turn stale timestamp test |
PR #511 Review (R3) — ✅ SHIP IT3/3 reviewers verified all 18 companion-pair sites. No issues found. Merging. |
…ocessing ForceCompleteProcessing (Organization.cs:2215) had HasDeferredIdle = false without the companion SubagentDeferStartedAtTicks clear. This violated the stated invariant that the two fields are an inseparable companion pair. A stale timestamp leaking across turns would cause immediate false zombie expiry on the next IDLE-DEFER. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split from #508. Separate concern — handles zombie subagent entries that prevent session completion.