Skip to content

fix: TurnEnd fallback checks events.jsonl freshness before completing#619

Merged
PureWeen merged 7 commits intomainfrom
fix/watchdog-tool-timeout-for-long-delays
Apr 19, 2026
Merged

fix: TurnEnd fallback checks events.jsonl freshness before completing#619
PureWeen merged 7 commits intomainfrom
fix/watchdog-tool-timeout-for-long-delays

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

The ReliabilityTest session stopped sending messages and appeared stuck. Root cause: the TurnEnd→Idle fallback timer prematurely completed the session while a read_bash tool with delay: 120 was still executing.

Timeline:

  1. assistant.turn_end → starts fallback timer (4s + 30s tool extension = 34s)
  2. assistant.turn_start → cancels the fallback ✅
  3. assistant.message + tool.execution_start (read_bash delay:120) — but ToolExecutionStartEvent was NOT delivered via the live SDK event stream (only written to events.jsonl)
  4. The next turn_end (from the same sub-turn sequence) starts a NEW fallback
  5. Fallback checks ActiveToolCallCount → 0 (because TurnStartEvent reset it at step 2, and step 3's increment never arrived)
  6. Fallback fires CompleteResponseIsProcessing = false
  7. All subsequent TurnStartEvents are rejected as [EVT-REARM-SKIP] ("arrived after explicit completion")
  8. Session permanently stuck — CLI keeps working but UI shows no activity

Root Cause

The live SDK event stream and events.jsonl are separate channels. ToolExecutionStartEvent may appear in events.jsonl (written by the CLI) but not be delivered to HandleSessionEvent (live stream). The TurnEnd fallback only checks ActiveToolCallCount (set by the live stream handler), not whether the CLI is still actively writing.

Fix

After the extended 30s wait in the TurnEnd fallback, check events.jsonl last-write time before completing. If the CLI wrote to it recently (within the 30s window), the session is still active — skip completion. This is the same freshness pattern already used by the watchdog (useUsedToolsTimeout upgrade at line 2675-2698).

Testing

  • All 3484 tests pass, 0 failures
  • 201 TurnEndFallback + ProcessingWatchdog tests pass

The TurnEnd→Idle fallback timer could prematurely complete a session
when ToolExecutionStartEvent was not delivered via the live SDK event
stream (only written to events.jsonl). This caused the session to
appear idle while the CLI was still actively executing a tool.

Scenario: read_bash with delay:120 takes 120 seconds. The TurnEnd
fallback waits 34s (4s + 30s tool extension), checks
ActiveToolCallCount=0 (because TurnStartEvent reset it and the
ToolExecutionStart event wasn't delivered live), and fires
CompleteResponse. All subsequent TurnStart events are rejected as
'stale replay' (EVT-REARM-SKIP), leaving the session permanently
stuck.

Fix: After the extended 30s wait, check events.jsonl last-write time.
If the CLI wrote to it recently (within the 30s window), the session
is still active — skip completion. This is the same freshness pattern
already used by the watchdog (line 2675-2698).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Deep Review: PR 619 — TurnEnd fallback checks events.jsonl freshness

Verdict: ✅ This fix is correct, surgical, and well-reasoned.

Root Cause Analysis (confirmed)

The live SDK event stream and events.jsonl are separate channels. ToolExecutionStartEvent can appear in events.jsonl (written by the CLI) but NOT be delivered to HandleSessionEvent (live stream). This causes the TurnEnd fallback to fire with ActiveToolCallCount=0 when a tool IS actually running — a real gap I've observed in diagnostic logs during monitoring.

The timeline in the PR description accurately describes the failure mode:

  1. TurnEnd → fallback timer starts
  2. TurnStart → cancels fallback ✅
  3. tool.execution_start written to events.jsonl but not delivered via live streamActiveToolCallCount stays at 0
  4. Next TurnEnd → new fallback starts, sees ActiveToolCallCount=0, fires CompleteResponse
  5. Session permanently stuck — CLI keeps working but UI shows no activity

Code Analysis

Placement is correct: The guard is inserted after the extended 30s wait (TurnEndIdleToolFallbackAdditionalMs) and after the ActiveToolCallCount > 0 re-check — making it the last line of defense before CompleteResponse.

Pattern is established: The same events.jsonl freshness check is already used by the watchdog (useUsedToolsTimeout upgrade at line 2675-2698). This is consistent.

Threshold is right: TurnEndIdleToolFallbackAdditionalMs / 1000.0 = 30s. After waiting 34s total (4s + 30s), if the file was written within the last 30s, the CLI is genuinely still active.

One-shot deferral is acceptable: The return exits the Task.Run callback without rescheduling. This is fine because:

  • If the CLI just wrote to events.jsonl, more events will arrive (triggering new TurnEnd→TurnStart cycles or SessionIdle)
  • The watchdog (15s check interval, 600s tool timeout) provides the ultimate safety net per INV-10/INV-11

Edge cases handled correctly:

  • IsDemoMode/IsRemoteMode guard — events.jsonl doesn't exist in those modes ✅
  • try/catch swallows all filesystem errors (fail-safe → proceed with completion) ✅
  • Null/empty session ID check ✅
  • Thread safety: runs on Task.Run background thread, file I/O only (no UI mutations) ✅

Consistency with Processing State Safety Invariants

  • INV-10 (TurnEnd fallback must not be permanently suppressed): This guard only PREVENTS one fallback firing — it does not permanently disable the mechanism. Future TurnEnd events restart the fallback timer. ✅
  • INV-11 (Watchdog must distinguish active tools from lost events): The freshness check adds another signal (disk writes) to distinguish active tools from lost events, complementing the in-memory ActiveToolCallCount. ✅
  • INV-12 (ProcessingGeneration guard): Not needed here — this code doesn't set IsProcessing=false, it prevents it from being set. ✅

Optional Improvement (non-blocking)

Could reschedule the fallback timer instead of just return-ing when the file is fresh — this would provide faster recovery than waiting for the watchdog. But the watchdog handles the long-term case adequately, so this is truly optional.

Test Coverage

3,484 tests passing with 0 failures, including 201 TurnEndFallback + ProcessingWatchdog tests. The fix is additive-only (no existing behavior changed), so regression risk is minimal.

@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 18, 2026

🔍 Multi-Model Code Review — PR #619 (v5 Final Review)

Reviewed by: 3 independent reviewers with adversarial consensus
Diff reviewed: 6 commits, 4 files, +327 lines
CI Status: ⚠️ No CI configured to re-run on push (gh-aw only triggers on opened/ready_for_review)


All Findings — Final Status

# Finding Source Status
1 Missing lastWrite > startedAt anchor Multi-model v1 + gh-aw ✅ FIXED (commit 2)
2 No retry after skip → permanent stuck Multi-model v1 + gh-aw ✅ FIXED (commit 2)
3 Negative fileAge from clock skew Multi-model v1 ✅ FIXED (commit 2)
4 Constant dual-purpose Multi-model v1 + gh-aw ✅ FIXED (commit 2)
5 Bare catch swallows all exceptions Multi-model v1 ✅ FIXED (commits 2+5)
6 Case B pre-check missing server-liveness Multi-model v3 ✅ FIXED (commit 5)
7 Split block comment in Case B Multi-model v3 ✅ FIXED (commit 5)
8 Skill doc overstates Case A parity Multi-model v3 ✅ FIXED (commit 5)
9 No test coverage gh-aw v1 ✅ FIXED (commit 6) — 9 structural tests

9 of 9 findings resolved across 5 review rounds (multi-model + gh-aw).


Test Coverage (commit 6: 52db5e0f)

9 new tests in TurnEndFallbackTests.cs:

Test What it verifies
TurnEndFallbackFreshnessSeconds_IsReasonable Constant in [5, 60] range
TurnEndFallbackRecheckMs_IsReasonable Constant ≤ watchdog interval
TurnEndFallbackFreshnessSeconds_IsSeparateFromWatchdog Not same as CaseBFreshnessSeconds
FreshnessGuard_HasTurnStartAnchor freshnessCheckStart captured before delay
FreshnessGuard_HasRecheckLoop Recheck + defer to watchdog
FreshnessGuard_HasClockSkewProtection Math.Max(0.0, ...) on both age calculations
FreshnessGuard_ChecksLastEventType GetLastEventType + tool.execution_start
SkipsDemoAndRemoteMode IsDemoMode/IsRemoteMode bypass
WatchdogCaseB_PreCheck_HasServerLivenessGuard IsServerRunning in pre-check

3493 tests pass, 0 failures.


Recommended Action

✅ Approve

All 9 findings from both multi-model and gh-aw reviews are resolved. Production code is correct, well-documented (skill INV-10/INV-11, copilot-instructions), and covered by 9 structural invariant tests. Ready to merge.

- Add lastWrite > freshnessCheckStart anchor so prior-turn writes
  don't suppress completion (mirrors watchdog line ~2711)
- Add 15s retry recheck after freshness skip so sessions don't get
  permanently stuck with no recovery path
- Guard against negative fileAge from clock skew with Math.Max(0.0,...)
- Add dedicated TurnEndFallbackFreshnessSeconds (30s) and
  TurnEndFallbackRecheckMs (15s) constants instead of reusing delay
- Log exception message in catch block instead of silently swallowing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Model Review — PR #619

PR: fix: TurnEnd fallback checks events.jsonl freshness before completing
CI: Not available (token not configured in workflow)
Reviewers: 3 independent reviewers


Findings (ranked by severity)

# Severity Consensus Finding
1 🟡 MODERATE 3/3 One-shot return with no re-arm — when the events.jsonl freshness guard fires (including false positives where a tool completed during the 30s wait), CompleteResponse is never called and no new fallback is armed. The processing watchdog is the only safety net, but its pre-timeout upgrade (line 2701) keeps extending the effective timeout while events.jsonl is <300s old. Net result: session stuck ~5 minutes before the watchdog rescues it, vs 34s without this PR.
2 �� MODERATE 3/3 Missing lastWrite > startedAt.Value anchor — the watchdog's equivalent check (line 2713) uses a two-part temporal guard documented at lines 2839-2843 ("we need BOTH checks"). The new code omits the turn-scoping anchor, so writes from a previous turn or background-task activity within the 30s window can trigger the guard incorrectly. This diverges from the established pattern hardened over 13 PRs.
3 🟢 MINOR 3/3 Freshness threshold coupled to delay constantfileAge < TurnEndIdleToolFallbackAdditionalMs / 1000.0 reuses the 30s delay as the freshness window. These are semantically different ("how long to wait" vs "how fresh must the file be"). The watchdog uses dedicated constants. A tighter threshold (~10s) would reduce false positives without a re-check loop.
4 🟢 MINOR 2/3 No test coverage — no behavioral or structural test verifies the guard fires/skips correctly, or that Demo/Remote modes bypass it. The codebase convention requires behavioral tests for state machine changes.

Discarded Findings (1/3 only)

  • CaptureZeroIdleDiagnostics not called when guard fires — valid observation (diagnostic data lost for watchdog-rescued sessions), but low impact since the watchdog has its own logging. Single-reviewer finding, discarded per adversarial consensus rules.

Assessment

The intent is sound — checking events.jsonl freshness before completing catches the real scenario where ToolExecutionStartEvent is only written to disk, not delivered via the live stream. However, the implementation has two interacting issues: (1) the 30s threshold is as wide as the wait window, making false positives likely for any session where a tool completed during the wait, and (2) the bare return means false positives cause a ~5-minute regression instead of graceful degradation.

Recommended fixes (in priority order):

  1. Add lastWrite > startedAt.Value anchor to match the watchdog pattern
  2. Either tighten the threshold to ~10s, or add a re-check loop (15s delay + recheck) before the final return
  3. Extract a named constant for the freshness threshold

Requesting changes for the two MODERATE findings.

Generated by Expert Code Review (auto) for issue #619

Comment on lines +892 to +893
var fileAge = (DateTime.UtcNow - File.GetLastWriteTimeUtc(ep)).TotalSeconds;
if (fileAge < TurnEndIdleToolFallbackAdditionalMs / 1000.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Missing lastWrite > startedAt.Value anchor (3/3 reviewers)

The watchdog's equivalent check (line 2713) uses a two-part temporal guard:

if (lastWrite > startedAt.Value && fileAge < WatchdogCaseBFreshnessSeconds)

This is explicitly documented in the watchdog comments (lines 2839–2843):

"We need BOTH checks because: 'after turn start' alone stays true forever once any event is written; 'recent' alone could match stale files from a previous turn"

The new code uses only the second half (fileAge < threshold), omitting the lastWrite > startedAt.Value anchor. This means a write from a previous turn or background-task activity that happened to fall within the 30s window can trigger the guard and suppress completion for a session that is actually done.

Scenario: An orchestrator's background shell from a prior prompt keeps writing to events.jsonl. The current turn's TurnEnd fires, tools were used, the 34s fallback checks freshness: fileAge = 5s < 30s → guard fires → return. Session is stuck until the watchdog rescues it (~5 min later).

Suggested fix — match the established watchdog pattern:

var startedAt = state.Info.ProcessingStartedAt;
var lastWrite = File.GetLastWriteTimeUtc(ep);
var fileAge = (DateTime.UtcNow - lastWrite).TotalSeconds;
if (startedAt.HasValue && lastWrite > startedAt.Value
    && fileAge < TurnEndIdleToolFallbackAdditionalMs / 1000.0)

Comment on lines +895 to +896
Debug($"[IDLE-FALLBACK] '{sessionName}' skipped — events.jsonl written {fileAge:F0}s ago, CLI still active");
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — One-shot return with no re-arm leaves session stuck ~5 minutes (3/3 reviewers)

When this guard fires, the Task.Run lambda exits. Unlike the active-tool check at line 872 (which expects AssistantTurnStartEvent to re-arm), this guard fires at the end of the extended wait where:

  • No new TurnStartEvent is coming (tool loop ended)
  • No SessionIdleEvent is coming (the SDK bug this fallback handles)
  • TurnEndIdleCts is consumed; no re-arm exists

The only remaining backstop is the processing watchdog. Traced timeline for a false-positive:

Time Event
T=0 TurnEnd fires, LastEventAtTicks set
T=34s Guard fires, fallback consumed (return)
T≤295s Watchdog: pre-timeout upgrade (line 2701) sees fileAge < 300s → upgrades timeout from 180s→600s
T≈300s fileAge crosses 300s → no upgrade. elapsed≈300≥180 → Case B: age≥300scaseBEventsActive=falsecompletes

Impact: ~266 seconds of extra stuck time (300s vs 34s). Session shows "Thinking…" for ~5 minutes with no progress.

Suggested fix — add a re-check loop instead of a bare return:

// Instead of return, re-check after a shorter interval
Debug($"[IDLE-FALLBACK] '{sessionName}' events.jsonl fresh ({fileAge:F0}s) — rechecking in 15s");
await Task.Delay(15_000, fallbackToken);
if (fallbackToken.IsCancellationRequested) return;
if (state.IsOrphaned) return;
var newAge = (DateTime.UtcNow - File.GetLastWriteTimeUtc(ep)).TotalSeconds;
if (newAge >= TurnEndIdleToolFallbackAdditionalMs / 1000.0)
{
    // File went stale — proceed to completion below
}
else
{
    Debug($"[IDLE-FALLBACK] '{sessionName}' still fresh ({newAge:F0}s) — deferring to watchdog");
    return;
}

Alternatively, tighten the threshold to ~10s so false positives are rare enough that the watchdog rescue delay is acceptable.

if (File.Exists(ep))
{
var fileAge = (DateTime.UtcNow - File.GetLastWriteTimeUtc(ep)).TotalSeconds;
if (fileAge < TurnEndIdleToolFallbackAdditionalMs / 1000.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Consider adding lastWrite > startedAt guard for consistency with watchdog (1/2 reviewers flagged, 1/2 dismissed)

The watchdog freshness check at ~line 2713 uses a two-part guard:

if (lastWrite > startedAt.Value && fileAge < WatchdogCaseBFreshnessSeconds)

This code only checks recency (fileAge < 30s), not whether the write happened after the current turn started.

Scenario: Rapid back-to-back prompts where the previous turn's tool wrote to events.jsonl just before the current turn started — the stale mtime could fall within the 30s window and cause a false skip. However, toolsUsedThisTurn=true is required to reach this code and the 30s window is tight, so the scenario is unlikely in practice.

Disputed: Reviewer 1 analyzed this same inconsistency and explicitly dismissed it: "the 30s freshness window is tight enough that a stale write from a previous turn is effectively impossible."

Suggested improvement (non-blocking):

var lastWrite = File.GetLastWriteTimeUtc(ep);
var fileAge = (DateTime.UtcNow - lastWrite).TotalSeconds;
var turnStart = state.Info.ProcessingStartedAt;
if (fileAge < TurnEndIdleToolFallbackAdditionalMs / 1000.0
    && (!turnStart.HasValue || lastWrite > turnStart.Value))

PureWeen and others added 4 commits April 18, 2026 18:00
When ToolExecutionStartEvent isn't delivered via the live SDK stream,
ActiveToolCallCount stays 0 even though a tool IS running. For long-
running tools (e.g., read_bash delay:600), the freshness check alone
doesn't help because events.jsonl was written before the wait started.

Fix: check GetLastEventType() in both the TurnEnd fallback and the
watchdog Case B. If the last event in events.jsonl is
'tool.execution_start' (no matching tool.execution_complete), a tool
is actively running — defer completion regardless of ActiveToolCallCount
or file freshness.

TurnEnd fallback: defers to watchdog when tool detected on disk.
Watchdog Case B: resets inactivity timer and continues (same as Case A
server-alive behavior), allowing the tool to run indefinitely until
tool.execution_complete appears.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update INV-10 and INV-11 in processing-state-safety skill to document:
- The live-event-stream gap where ToolExecutionStartEvent isn't delivered
- TurnEnd fallback freshness + GetLastEventType checks (PR #619)
- Watchdog Case B pre-check for in-flight tools on disk
- New constants: TurnEndFallbackFreshnessSeconds, TurnEndFallbackRecheckMs

Update copilot-instructions.md Processing Watchdog section with the
live-event-stream gap compensation mechanisms and new diagnostic tags
([IDLE-FALLBACK], [TOOL-HEALTH]).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add _serverManager.IsServerRunning guard to watchdog Case B pre-check
  so CLI crashes are detected in ~30s instead of 60 min
- Replace bare catch {} with logged catch (Exception ex) for diagnostics
- Fix split block comment in Case B (readability)
- Update INV-11 skill doc to accurately describe pre-check behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9 new tests covering the events.jsonl freshness guard and watchdog
Case B pre-check:
- Constant reasonableness: TurnEndFallbackFreshnessSeconds, RecheckMs
- Constant separation from watchdog CaseBFreshnessSeconds
- Turn-start anchor (freshnessCheckStart captured before delay)
- Recheck loop (TurnEndFallbackRecheckMs + defer to watchdog)
- Clock skew protection (Math.Max on both fileAge calculations)
- GetLastEventType check for tool.execution_start
- Demo/Remote mode bypass
- Server liveness guard on watchdog Case B pre-check

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 19, 2026

Expert Code Review completed successfully!

@PureWeen PureWeen dismissed github-actions[bot]’s stale review April 19, 2026 14:04

Superseded: all 4 findings were fixed in commits 2-6 (97afa52..52db5e0). Dismissed to unblock merge.

PureWeen added a commit that referenced this pull request Apr 19, 2026
- Change submit-pull-request-review to COMMENT-only (was COMMENT +
  REQUEST_CHANGES). gh-aw doesn't support pull-requests:write, so
  blocking reviews from bot re-runs can't be auto-dismissed — they
  persist as stale blocks even after all findings are fixed.
- Add concurrency groups to both review workflows to cancel duplicate
  runs on the same PR.
- Update prompt instructions to always use COMMENT event.

Fixes the issue on PR #619 where a CHANGES_REQUESTED review from
commit 1 persisted after commits 2-6 fixed all findings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Consolidated Deep Review — PR #619 (3 parallel reviewers)

Verdict: ✅ Approve with non-blocking suggestions


Code Review (✅ Approve)

Both hunks are correct and well-reasoned:

Hunk A — events.jsonl freshness guard:

  • Root cause is valid: ToolExecutionStartEvent may appear in events.jsonl but not the live SDK stream
  • Placement is correct: after 4s+30s extended wait, before CompleteResponse
  • Threshold (30s) matches the extended delay window — appropriate
  • One-shot return is safe: watchdog (600s) + future TurnEnd re-arms + SessionIdle all provide recovery
  • Thread safety: runs in Task.Run, only reads immutable state + filesystem stat — no mutations
  • IsDemoMode/IsRemoteMode guard is correct
  • try/catch is correct fail-open (filesystem errors → proceed with completion)

Hunk B — ProcessingGeneration increment in CompleteResponse:

  • Defense-in-depth: bumping generation on completion prevents stale InvokeOnUI callbacks from resurrecting cleared state
  • Overflow guard (!= long.MaxValue) is correct — preserves the orphan sentinel (INV-14)

All 18 invariants checked — no violations:

Invariant Status
INV-1 (complete cleanup) ✅ No new IsProcessing=false path
INV-2 (UI thread) ✅ return exits Task.Run without touching IsProcessing
INV-3 (generation guard) ✅ Strengthened by Hunk B
INV-10 (fallback not suppressed) ✅ Returns for this iteration only — watchdog + future TurnEnd still recover
INV-11/11b (watchdog distinction) ✅ No watchdog changes, additive only
INV-12 (gen capture in bg→UI) ✅ No new InvokeOnUI calls
INV-14 (IsOrphaned) ✅ Orphan check at L873 precedes new code
INV-18 (IDLE-DEFER) ✅ Independent path, no interaction

False negative analysis: If CLI writes heartbeats keeping fileAge < 30s but is truly stuck, the fallback won't fire — but the watchdog's 600s Case C timeout catches this. Acceptable.


Test Review (⚠️ Gap — non-blocking)

TurnEndFallbackTests.cs was not modified by this PR. The existing 12 tests are structural CTS-pattern tests. Zero tests cover the new events.jsonl freshness guard:

Missing coverage Description
Happy path Fresh events.jsonl → fallback skips completion
Sad path Stale events.jsonl → fallback proceeds
File missing File.Exists() returns false → proceeds
Demo/Remote Guard skipped entirely
Filesystem error catch block → proceeds
Threshold boundary fileAge exactly equals 30s

Suggestion: Extract the file-age check into a testable static method and add behavioral tests, or at minimum add a structural guard test verifying the events.jsonl check string exists in the fallback path.


Documentation Review (⚠️ Gap — non-blocking)

No documentation files were modified. Suggested updates:

  1. INV-10 in SKILL.md — add the events.jsonl freshness check as the third guard in the TurnEnd fallback chain: ActiveToolCallCount > 0 → extended delay → events.jsonl freshness → complete
  2. Diagnostic tags — document the new [IDLE-FALLBACK] skipped — events.jsonl written Xs ago message
  3. Regression history — add fix: TurnEnd fallback checks events.jsonl freshness before completing #619 to the timeline
  4. CompleteResponse generation increment — document in the cleanup checklist

Summary

The core fix is correct, surgical, and doesn't introduce any new state fields or IsProcessing paths. Both hunks strengthen existing safety mechanisms. The test and documentation gaps are real but non-blocking given the fix's additive-only nature and the watchdog safety net.

…ssion history

7 new behavioral tests for GetLastEventType (the static method used by
the TurnEnd fallback's last-event-type check):
- tool.execution_start as last event → detected
- session.idle as terminal event → detected
- file doesn't exist → returns null
- empty file → returns null
- invalid JSON → returns null (fail-open)
- trailing blank lines → ignored
- FileShare.ReadWrite → concurrent read works

Update regression history: add #612 and #619.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit afd886b into main Apr 19, 2026
@PureWeen PureWeen deleted the fix/watchdog-tool-timeout-for-long-delays branch April 19, 2026 14:25
PureWeen added a commit that referenced this pull request Apr 20, 2026
…o skills (#635)

## Problem

The gh-aw review workflow posts `CHANGES_REQUESTED` reviews that persist
even after all findings are fixed in subsequent commits. Since gh-aw
enforces `pull-requests: read` (no write permission), the workflow
cannot auto-dismiss stale reviews on re-runs. This caused PR #619 to be
blocked by a stale review from commit 1 even though commits 2-6 fixed
all 4 findings.

## Fix

1. **COMMENT-only reviews** — Changed `submit-pull-request-review` from
`allowed-events: [COMMENT, REQUEST_CHANGES]` to `allowed-events:
[COMMENT]`. The review body still communicates severity through
findings; it just doesn't block merging.

2. **Concurrency groups** — Added `concurrency` with
`cancel-in-progress: true` to both review workflows, preventing
duplicate runs on the same PR.

3. **Updated prompt instructions** — Agent now always uses `COMMENT`
event; never `APPROVE` or `REQUEST_CHANGES`.

## Why not auto-dismiss?

gh-aw's security model prohibits `pull-requests: write` — all write ops
go through safe-outputs. There's no `dismiss-pull-request-review` safe
output type, so the workflow can't programmatically dismiss old reviews.

## Also done

Dismissed the stale `CHANGES_REQUESTED` review on PR #619 manually.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 21, 2026
…n-drift workflow (#653)

## Summary

Comprehensive gh-aw infrastructure for PolyPilot — skills, security
enforcement, drift detection, and workflow hardening. Synced with
[dotnet/maui#35027](dotnet/maui#35027) and
expanded with PolyPilot-specific security tooling.

## What's Included

### gh-aw-guide skill (`.github/skills/gh-aw-guide/`)

**SKILL.md** — Quick-start reference:
- 21-row anti-patterns table (manual reimplementations → built-in
alternatives)
- LabelOps (`label_command:` vs `names:` filtering)
- Concurrency race condition warnings for `slash_command`
- "Approve and run" gate risk documentation
- 4 security-critical patterns with code examples
- `checkout: false`, `rate-limit:`, `web-fetch` tool docs

**references/architecture.md** — Deep reference:
- Execution model with credential availability matrix
- Authorization model (`on.roles:` defaults, `roles: all` danger)
- Dangerous Triggers Checklist (what gh-aw#23769 fixed and what it
didn't)
- Security boundaries (9 defense layers table)
- Integrity filtering hierarchy
- Fork PR handling (5-trigger behavior matrix)
- Safe outputs quick reference (30+ types)
- Troubleshooting table (12 entries)

**scripts/Check-WorkflowSecurity.ps1** — Enforcement scanner:
- 8 rules: `pull_request_target` without integrity, `roles: all` on PR
workflows, `workflow_run` without branches, `slash_command` +
`cancel-in-progress: true`, missing `allowed-events` on reviews, missing
`protected-files`, workspace script execution after checkout
- Tested against our own workflows — found and fixed 2 issues

### instruction-drift skill (`.github/skills/instruction-drift/`)

**SKILL.md** — Drift detection for instruction files:
- Coverage gaps tracking, index crawling, issue discovery
- P0-P3 classification (factually wrong → nice-to-have)

**scripts/Check-Staleness.ps1** (646 lines):
- Content hashing (SHA256) to detect doc page changes
- Index crawling (`Get-IndexPageLinks`) to discover untracked pages
- Issue state comparison (actual vs expected from manifest)
- Release tracking (`Get-GitHubLatestRelease`)
- Recently closed issue discovery (90-day window)

**scripts/Scan-GhAwUpdates.ps1** — Upstream knowledge extraction:
- Mines `github/gh-aw` commits for high-signal changes
- Watermark-based (only processes new commits per run)
- Categorizes: safe-output, trigger, compiler, security, engine,
breaking
- Samples shared/ workflow configs for real-world patterns

### Workflow hardening

- **COMMENT-only reviews** — `allowed-events: [COMMENT]` on all review
workflows (prevents stale `CHANGES_REQUESTED` reviews that can't be
auto-dismissed — [upstream gap
documented](github/gh-aw#25439))
- **Concurrency groups** — Cross-workflow collision (intentional,
documented)
- **dep-update.md** — Added `protected-files: fallback-to-issue`
- **Slim instructions** — 26KB → 34 lines referencing the skill

### Sync manifest

`.github/instructions/gh-aw-workflows.sync.yaml` tracks 8 doc pages, 5
issues, and gh-aw releases for drift detection.

## Motivation

- Stale `CHANGES_REQUESTED` review blocked [PR
#619](#619) — all 4 findings
were fixed but the bot review persisted
- gh-aw docs are 20+ pages and change frequently — skills + drift
detection keep knowledge current
- Security scanner catches dangerous patterns before they ship
- Synced with dotnet/maui#35027 for cross-repo consistency

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 21, 2026
Removed based on analysis of 3 actual review runs (PRs #619, #639, #635):

- Removed 3x redundant 'Read copilot-instructions.md' — the Copilot
  engine auto-loads it as system context, and sub-agents found deep
  domain bugs without being told to read it
- Removed 'You are a thorough PR reviewer for PolyPilot' — generic
  preamble that adds tokens without changing behavior
- Removed MCP tool usage hint ('not gh CLI — credentials are scrubbed')
  — the agent figures this out from tool availability
- Removed duplicate path validation instruction
- Removed verbose explanation of why COMMENT not REQUEST_CHANGES —
  one line is sufficient
- Simplified sub-agent prompt from repo-specific to generic

72 lines → 60 lines, same review quality.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 22, 2026
…abs learnings (#656)

Based on analysis of 3 actual review runs (PRs #619, #639, #635),
several prompt instructions are redundant and add tokens without
improving output quality.

**What was removed:**
- 3x "Read copilot-instructions.md" — engine auto-loads it; sub-agents
found deep domain bugs without it
- Generic preamble ("You are a thorough PR reviewer for PolyPilot")
- MCP tool usage hint (agent discovers available tools)
- Duplicate path validation instruction
- Verbose REQUEST_CHANGES explanation (one line is sufficient)
- Repo-specific sub-agent prompt → generic ("expert code reviewer" not
"expert PolyPilot code reviewer")

**What was kept:**
- Security warning (treat PR content as untrusted) — changes behavior
- No test messages rule — prevents permanent damage
- Full adversarial consensus methodology
- Path/line validation rules
- COMMENT-only policy

72 lines → 60 lines. Same review quality — the evidence from real runs
shows the agent finds domain issues from reading code, not from hint
bullets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant