Add watchdog diagnostic for MCP integration test hangs#134
Merged
Conversation
This was referenced Apr 21, 2026
Every few PRs lately, a MacOSMCPTests case has timed out at exactly 600s on CI with zero output: the Swift Testing `.timeLimit(.minutes(10))` fires after ten minutes of silence, then the GHA step kill truncates the run three minutes later. The test's own internal timeouts never get a chance to dump the server stderr they were designed to, because when the cooperative thread pool starves the `Task.sleep(for:)` inside the timeout primitive can't resume — so `Issue.record` paths never run and CI failure logs contain nothing actionable (see runs 72166579769, 72170918105, 72234336170, 72323677364, 72328816376 — five failures across four different MCP tests in 36 hours). This change makes the next occurrence debuggable without attempting a speculative fix for the underlying starvation. Each `MCPTestServer` now schedules a `DispatchSourceTimer` on a libdispatch queue (not the Swift cooperative pool) that fires every 60s and writes the server subprocess's elapsed runtime plus the tail of its captured stderr to the test process's own stderr via a blocking `fputs`. Tests that complete in under 60s are silent on the happy path; a hanging test produces one heartbeat per minute with concrete context for post-mortem. CI also gets a new post-failure step that dumps every `previewsmcp-test-*.log` under `$TMPDIR` / `/tmp`, mirroring the existing daemon-log dump on the CLI step so the per-server stderr files survive as job logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first attempt used DispatchSource.makeTimerSource with qos .utility.
CI run 72345678664 hit the exact hang this watchdog was built for —
MacOSMCPTests.hotReloadStructural timed out silently at 600s — and the
watchdog emitted no heartbeats. libdispatch's utility-QoS worker
threads are the first to starve when a busy-spin pegs the cores; no
scheduling guarantees kept the timer alive.
A detached Thread (`Thread.detachNewThread`) is a real kernel thread
scheduled by the OS rather than libdispatch, and `Thread.sleep` blocks
in the kernel instead of suspending a cooperative task. That sidesteps
QoS starvation entirely. The cancellation flag moves from
`DispatchSourceTimer.cancel()` to an `OSAllocatedUnfairLock<Bool>` the
thread polls between sleeps.
The post-failure stderr dump step added in the same PR worked as
intended — it's what revealed the actual root cause: the daemon
successfully compiled and reloaded ("Compiled:" + "Reloaded!" in
server stderr) but MCPTestServer's subsequent preview_snapshot
callTool never returned. That's a separate bug in the MCP client /
transport layer and will be investigated with the better diagnostic
now in place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
de1255a to
23adda3
Compare
4 tasks
obj-p
added a commit
that referenced
this pull request
Apr 22, 2026
) * Add targeted stderr instrumentation in preview_snapshot + reload path Phase 4A of issue #135. Phases 1-3 catch the daemon post-reload wedge (CLI stall-detects at 30s, test harness pthread-terminates at 30s), but we still don't know WHERE the handler hangs. The server stderr dump from PR #134 currently goes silent after "Reloaded!", leaving root-cause analysis speculative. This PR adds ~8 targeted `fputs` markers through `handlePreviewSnapshot` and one after the reload's `MainActor.run` block. Each marker writes to the server subprocess's stderr — captured in the per-instance temp file and dumped by PR #134's post-failure CI step. When the next flake hits, the CI log will show exactly how far the handler got before the wedge: [snapshot] enter [snapshot] pre session-check ← MainActor queue blocked here? [snapshot] post session-check [snapshot] post 300ms sleep ← Task.sleep problem? [snapshot] main-actor capture enter ← MainActor work starts [snapshot] main-actor capture done ← Snapshot.capture returned [snapshot] encoding [snapshot] returning Plus on the reload side, after the MainActor.run block containing `loadPreview`: [reload] main-actor block returned No behavior change. `fputs` + `fflush` is synchronous libc, no Swift concurrency involvement, no allocations beyond the literal strings. Verified locally: the full 7-stage sequence appears on the happy path of `hotReloadStructural`. Phase 4B (the actual fix) is gated on seeing what a real CI failure shows. This PR is the diagnostic prerequisite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address review: add throw-path marker, TODO tag, user-visibility note Round-1 review of #139 flagged three optional polishes worth applying before merge: - The `[snapshot] main-actor capture enter` marker fires on entry but the `captureFailed` throw path (when `host.window(for:)` returns nil) exits the MainActor block without a matching marker. Added a do/catch around the block with a `[snapshot] main-actor capture threw: ...` marker. This way the CI dump shows we exited the block one way or another, even on error paths. - Added a `TODO(#135-4B)` tag on the instrumentation's doc comment so the cleanup obligation is grep-visible when Phase 4B lands. - Expanded the doc comment with two clarifications the reviewer verified: (1) markers are not user-visible because `DaemonClient.registerStderrLogForwarder` forwards MCP `LogMessageNotification` payloads, not raw daemon stderr; (2) macOS snapshot handler only — iOS returns early above. No behavior change. `swift test --filter hotReloadStructural` passes in 8.5s; manually verified the new throw-path marker does not fire on the happy path (only on the failure path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove 300ms pre-capture Task.sleep (Phase 4B root-cause fix for #135) Phase 4A (same PR, prior commit) added markers that caught the post-reload wedge on the first CI run after merge. Run 72439165932's server stderr dump showed: [snapshot] enter [snapshot] pre session-check [snapshot] post session-check ← last marker before silence The next marker `[snapshot] post 300ms sleep` never fires. The ONLY code between those two markers is: guard isMacOSSession else { ... } // no await try await Task.sleep(for: .milliseconds(300)) // ← wedges here The hang is `Task.sleep`, which ultimately depends on Swift's cooperative scheduler to resume. On a hot-reload test that polls snapshots rapidly without swiftc breaks (the literal-only path in particular — dozens of rapid snapshot calls per `awaitSnapshotChange` loop), the pool accumulates enough load that the timer's continuation stays unscheduled. Test `hotReloadLiteralOnly` hit this at around the 8th snapshot cycle on CI run 72439165932. This is exactly the failure mode issue #135 described — and the reason PR #136 built a pthread-based timeout for the test harness. Same pattern, different place. The sleep itself is unnecessary: `Snapshot.capture` calls `NSView.cacheDisplay(in:to:)`, which forces layout synchronously when the view is dirty. The MCP client→server round trip already gives the UI plenty of time to settle between view mount and capture. Verified: all 7 MacOSMCPTests pass without the sleep — including `hotReloadLiteralOnly` and `hotReloadStructural`, both of which assert image bytes actually change after the reload. The sleep was added in the initial commit (820b0cd) with the generic comment "Wait briefly for SwiftUI to finish layout". No evidence it was ever load-bearing; appears to have been defensive paranoia that turned load-bearing only as an unintended starvation trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five
MacOSMCPTestsfailures in the past 36 hours across four different tests have the same signature:Time limit was exceeded: 600.000 seconds, zero intermediate output, step killed by the 15-minute GHA timeout. The test's internal timeouts (30s/90s withIssue.record-on-throw) never fire because the Swift concurrency cooperative pool is starved — so theirTask.sleep-based timers can't resume.This PR adds a GCD-based watchdog on every `MCPTestServer` that survives cooperative-pool starvation and makes the next failure actually diagnosable.
Evidence
All five land on the same `MacOSMCPTests.swift` test-function decorator line, never reach the test body's `Issue.record` paths, and get killed by the 15-minute step timeout. Not reproducible locally — all 12 MCP tests pass in ~65s.
What this changes
Watchdog in `MCPTestServer`
On the happy path, zero heartbeats — tests that complete in under 60s never emit. On a hang, one line per minute gives a minute-by-minute trail of what the server subprocess was doing.
CI post-failure dump
`build-and-test` now has a "Dump MCP server stderr on failure" step mirroring the existing CLI-step equivalent. It concatenates every `previewsmcp-test-*.log` (per-instance stderr file `MCPTestServer` creates) plus the daemon's `serve.log` into job logs.
AGENTS.md
One-line note under "CI-specific concerns" documenting the watchdog and the "do not replace with `Task.sleep`" invariant.
Deliberately out of scope
Not attempting to fix the starvation itself. The pattern (5 different tests hit, all at 600s exactly) points to a systemic issue — likely the MCP SDK's `AsyncThrowingStream` busy-spin mentioned in `MCPTestServer.stop()`'s existing comment, or an accumulation across serialized tests. But (a) it doesn't reproduce locally, (b) any speculative fix can't be validated without CI cycles, and (c) the diagnostic unlocks targeted investigation once we see what the server was actually doing.
Test plan
Blocks
Intended to merge before PR #133, which is held because the hot-reload flake is unrelated to that PR's code but blocks its CI.
🤖 Generated with Claude Code