Skip to content

Add watchdog diagnostic for MCP integration test hangs#134

Merged
obj-p merged 3 commits intomainfrom
fix/mcp-integration-test-flake
Apr 22, 2026
Merged

Add watchdog diagnostic for MCP integration test hangs#134
obj-p merged 3 commits intomainfrom
fix/mcp-integration-test-flake

Conversation

@obj-p
Copy link
Copy Markdown
Owner

@obj-p obj-p commented Apr 21, 2026

Summary

Five MacOSMCPTests failures in the past 36 hours across four different tests have the same signature: Time limit was exceeded: 600.000 seconds, zero intermediate output, step killed by the 15-minute GHA timeout. The test's internal timeouts (30s/90s with Issue.record-on-throw) never fire because the Swift concurrency cooperative pool is starved — so their Task.sleep-based timers can't resume.

This PR adds a GCD-based watchdog on every `MCPTestServer` that survives cooperative-pool starvation and makes the next failure actually diagnosable.

Evidence

Run PR Test that timed out silently
72166579769 #130 attempt `preview_variants captures multiple configurations`
72170918105 #131 attempt `structuredContent payloads decode for each migrated tool`
72234336170 #132 attempt `preview_variants captures multiple configurations`
72323677364 #133 run 1 `File edit triggers hot reload (structural recompile path)`
72328816376 #133 run 2 `File edit triggers hot reload (structural recompile path)`

All five land on the same `MacOSMCPTests.swift` test-function decorator line, never reach the test body's `Issue.record` paths, and get killed by the 15-minute step timeout. Not reproducible locally — all 12 MCP tests pass in ~65s.

What this changes

Watchdog in `MCPTestServer`

  • `DispatchSourceTimer` on `DispatchQueue.global(qos: .utility)` fires every 60s.
  • Handler reads the captured per-instance stderr log and writes `[MCPTestServer watchdog t=Xs] alive — server stderr tail: …` to the test process's own stderr via `fputs` + `fflush` (blocking I/O, thread-pool-independent).
  • `stop()` cancels the timer; `[weak server]` handler is a no-op if the server has been released.

On the happy path, zero heartbeats — tests that complete in under 60s never emit. On a hang, one line per minute gives a minute-by-minute trail of what the server subprocess was doing.

CI post-failure dump

`build-and-test` now has a "Dump MCP server stderr on failure" step mirroring the existing CLI-step equivalent. It concatenates every `previewsmcp-test-*.log` (per-instance stderr file `MCPTestServer` creates) plus the daemon's `serve.log` into job logs.

AGENTS.md

One-line note under "CI-specific concerns" documenting the watchdog and the "do not replace with `Task.sleep`" invariant.

Deliberately out of scope

Not attempting to fix the starvation itself. The pattern (5 different tests hit, all at 600s exactly) points to a systemic issue — likely the MCP SDK's `AsyncThrowingStream` busy-spin mentioned in `MCPTestServer.stop()`'s existing comment, or an accumulation across serialized tests. But (a) it doesn't reproduce locally, (b) any speculative fix can't be validated without CI cycles, and (c) the diagnostic unlocks targeted investigation once we see what the server was actually doing.

Test plan

  • `swift build` — clean
  • `swift test --filter "MCPIntegrationTests" --skip "IOSMCPTests"` — 12/12 pass, 65s total. No heartbeats emitted (all tests < 60s). Expected.
  • CI on this PR — confirms watchdog doesn't break anything and the post-failure dump step works when a test legitimately fails.
  • Next CI flake on any PR — heartbeat output in job logs (`[MCPTestServer watchdog t=60s] ...`) and per-instance stderr dump in post-failure step.

Blocks

Intended to merge before PR #133, which is held because the hot-reload flake is unrelated to that PR's code but blocks its CI.

🤖 Generated with Claude Code

obj-p and others added 3 commits April 21, 2026 21:33
Every few PRs lately, a MacOSMCPTests case has timed out at exactly 600s
on CI with zero output: the Swift Testing `.timeLimit(.minutes(10))`
fires after ten minutes of silence, then the GHA step kill truncates
the run three minutes later. The test's own internal timeouts never get
a chance to dump the server stderr they were designed to, because when
the cooperative thread pool starves the `Task.sleep(for:)` inside the
timeout primitive can't resume — so `Issue.record` paths never run and
CI failure logs contain nothing actionable (see runs 72166579769,
72170918105, 72234336170, 72323677364, 72328816376 — five failures
across four different MCP tests in 36 hours).

This change makes the next occurrence debuggable without attempting a
speculative fix for the underlying starvation. Each `MCPTestServer`
now schedules a `DispatchSourceTimer` on a libdispatch queue (not the
Swift cooperative pool) that fires every 60s and writes the server
subprocess's elapsed runtime plus the tail of its captured stderr to
the test process's own stderr via a blocking `fputs`. Tests that
complete in under 60s are silent on the happy path; a hanging test
produces one heartbeat per minute with concrete context for post-mortem.

CI also gets a new post-failure step that dumps every
`previewsmcp-test-*.log` under `$TMPDIR` / `/tmp`, mirroring the
existing daemon-log dump on the CLI step so the per-server stderr
files survive as job logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first attempt used DispatchSource.makeTimerSource with qos .utility.
CI run 72345678664 hit the exact hang this watchdog was built for —
MacOSMCPTests.hotReloadStructural timed out silently at 600s — and the
watchdog emitted no heartbeats. libdispatch's utility-QoS worker
threads are the first to starve when a busy-spin pegs the cores; no
scheduling guarantees kept the timer alive.

A detached Thread (`Thread.detachNewThread`) is a real kernel thread
scheduled by the OS rather than libdispatch, and `Thread.sleep` blocks
in the kernel instead of suspending a cooperative task. That sidesteps
QoS starvation entirely. The cancellation flag moves from
`DispatchSourceTimer.cancel()` to an `OSAllocatedUnfairLock<Bool>` the
thread polls between sleeps.

The post-failure stderr dump step added in the same PR worked as
intended — it's what revealed the actual root cause: the daemon
successfully compiled and reloaded ("Compiled:" + "Reloaded!" in
server stderr) but MCPTestServer's subsequent preview_snapshot
callTool never returned. That's a separate bug in the MCP client /
transport layer and will be investigated with the better diagnostic
now in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p force-pushed the fix/mcp-integration-test-flake branch from de1255a to 23adda3 Compare April 22, 2026 01:36
@obj-p obj-p merged commit f0fbad2 into main Apr 22, 2026
4 checks passed
@obj-p obj-p deleted the fix/mcp-integration-test-flake branch April 22, 2026 03:37
obj-p added a commit that referenced this pull request Apr 22, 2026
)

* Add targeted stderr instrumentation in preview_snapshot + reload path

Phase 4A of issue #135. Phases 1-3 catch the daemon post-reload wedge
(CLI stall-detects at 30s, test harness pthread-terminates at 30s),
but we still don't know WHERE the handler hangs. The server stderr
dump from PR #134 currently goes silent after "Reloaded!", leaving
root-cause analysis speculative.

This PR adds ~8 targeted `fputs` markers through `handlePreviewSnapshot`
and one after the reload's `MainActor.run` block. Each marker writes
to the server subprocess's stderr — captured in the per-instance
temp file and dumped by PR #134's post-failure CI step. When the
next flake hits, the CI log will show exactly how far the handler
got before the wedge:

  [snapshot] enter
  [snapshot] pre session-check          ← MainActor queue blocked here?
  [snapshot] post session-check
  [snapshot] post 300ms sleep           ← Task.sleep problem?
  [snapshot] main-actor capture enter   ← MainActor work starts
  [snapshot] main-actor capture done    ← Snapshot.capture returned
  [snapshot] encoding
  [snapshot] returning

Plus on the reload side, after the MainActor.run block containing
`loadPreview`:

  [reload] main-actor block returned

No behavior change. `fputs` + `fflush` is synchronous libc, no Swift
concurrency involvement, no allocations beyond the literal strings.
Verified locally: the full 7-stage sequence appears on the happy
path of `hotReloadStructural`.

Phase 4B (the actual fix) is gated on seeing what a real CI failure
shows. This PR is the diagnostic prerequisite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address review: add throw-path marker, TODO tag, user-visibility note

Round-1 review of #139 flagged three optional polishes worth applying
before merge:

- The `[snapshot] main-actor capture enter` marker fires on entry but
  the `captureFailed` throw path (when `host.window(for:)` returns nil)
  exits the MainActor block without a matching marker. Added a
  do/catch around the block with a `[snapshot] main-actor capture
  threw: ...` marker. This way the CI dump shows we exited the block
  one way or another, even on error paths.

- Added a `TODO(#135-4B)` tag on the instrumentation's doc comment so
  the cleanup obligation is grep-visible when Phase 4B lands.

- Expanded the doc comment with two clarifications the reviewer
  verified: (1) markers are not user-visible because
  `DaemonClient.registerStderrLogForwarder` forwards MCP
  `LogMessageNotification` payloads, not raw daemon stderr;
  (2) macOS snapshot handler only — iOS returns early above.

No behavior change. `swift test --filter hotReloadStructural` passes
in 8.5s; manually verified the new throw-path marker does not fire
on the happy path (only on the failure path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove 300ms pre-capture Task.sleep (Phase 4B root-cause fix for #135)

Phase 4A (same PR, prior commit) added markers that caught the
post-reload wedge on the first CI run after merge. Run 72439165932's
server stderr dump showed:

  [snapshot] enter
  [snapshot] pre session-check
  [snapshot] post session-check     ← last marker before silence

The next marker `[snapshot] post 300ms sleep` never fires. The ONLY
code between those two markers is:

  guard isMacOSSession else { ... }           // no await
  try await Task.sleep(for: .milliseconds(300))  // ← wedges here

The hang is `Task.sleep`, which ultimately depends on Swift's
cooperative scheduler to resume. On a hot-reload test that polls
snapshots rapidly without swiftc breaks (the literal-only path in
particular — dozens of rapid snapshot calls per `awaitSnapshotChange`
loop), the pool accumulates enough load that the timer's continuation
stays unscheduled. Test `hotReloadLiteralOnly` hit this at around
the 8th snapshot cycle on CI run 72439165932.

This is exactly the failure mode issue #135 described — and the
reason PR #136 built a pthread-based timeout for the test harness.
Same pattern, different place.

The sleep itself is unnecessary: `Snapshot.capture` calls
`NSView.cacheDisplay(in:to:)`, which forces layout synchronously
when the view is dirty. The MCP client→server round trip already
gives the UI plenty of time to settle between view mount and
capture. Verified: all 7 MacOSMCPTests pass without the sleep —
including `hotReloadLiteralOnly` and `hotReloadStructural`, both
of which assert image bytes actually change after the reload.

The sleep was added in the initial commit (820b0cd) with the
generic comment "Wait briefly for SwiftUI to finish layout". No
evidence it was ever load-bearing; appears to have been defensive
paranoia that turned load-bearing only as an unintended starvation
trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant