Skip to content

Make MCPTestServer.withTimeout starvation-immune via pthread timer#136

Merged
obj-p merged 2 commits intomainfrom
fix/mcp-test-pthread-timeout
Apr 21, 2026
Merged

Make MCPTestServer.withTimeout starvation-immune via pthread timer#136
obj-p merged 2 commits intomainfrom
fix/mcp-test-pthread-timeout

Conversation

@obj-p
Copy link
Copy Markdown
Owner

@obj-p obj-p commented Apr 21, 2026

Phase 3 of the issue #135 implementation plan. Unblocks #133's CI by making the test-harness timeout primitive fire even when the Swift cooperative thread pool is pegged in the MCP SDK's busy-spin. First of three PRs; Phases 1 and 2 (daemon-side heartbeat + client-side stall detection for production callers) will follow.

Problem

MCPTestServer.withTimeout raced body against Task.sleep(for:) inside a withThrowingTaskGroup. When the MCP SDK's transport loop is busy-spinning on a wedged server — the pattern in issue #135 — the cooperative thread pool is exhausted and Task.sleep never resumes to throw. The 30s callToolWithTimeout deadline silently passes; only Swift Testing's outer .timeLimit(.minutes(10)) catches it, by which point 10 minutes of CI have been burned with no diagnostic context.

Three CI runs hit this in the past 36 hours: 72323677364, 72328816376, 72345678664.

Fix

withTimeout is restructured around a withCheckedThrowingContinuation with two racers:

  1. Body path: ordinary Task that resumes on success/failure.
  2. Timeout path: Thread.detachNewThread (real kernel thread — not GCD, not Swift concurrency) that `Thread.sleep`s for duration, then calls process.terminate() on the server subprocess and resumes the continuation with TestTimeoutSentinel.

OSAllocatedUnfairLock<Bool> serializes exactly-once resume across the race. CheckedContinuation.resume is a synchronous kernel primitive, so the caller unblocks regardless of pool state.

Tradeoff

The body `Task` may remain suspended forever if its pending MCP continuation is never drained (the SDK's transport EOF does not resume pendingRequests — see `Client.swift:234-276` vs `:304-342`; only explicit `disconnect()` drains them, and that's itself async, so we can't run it from the pthread). This leak is accepted because:

  • The caller has already been resumed via the continuation — the test body completes.
  • The subprocess is terminated by the pthread, so subsequent tests start clean.
  • The process exits shortly after the test run.

What this PR does NOT do

This is strictly test-harness hardening that gets CI green and unblocks #133.

Test plan

  • swift build — clean
  • swift-format lint --strict --recursive Sources/ Tests/ examples/ — clean
  • swift test --filter MacOSMCPTests — 6/6 green locally, 67.241s total
  • New regression test withTimeoutFiresUnderStarvation: body occupies a cooperative thread with POSIX sleep(60) (no yield), pthread fires at 2s budget, test asserts elapsed ∈ [2,5) and subprocess.isRunning == false. Passes in 2.013s.
  • CI on this PR — confirms nothing else regressed and Phase 3 doesn't introduce new flakes.

Related

🤖 Generated with Claude Code

obj-p and others added 2 commits April 21, 2026 16:25
The prior implementation raced `body` against `Task.sleep(for:)` inside a
`withThrowingTaskGroup`. When the MCP SDK's transport loop busy-spins on
a wedged server — the exact failure mode in issue #135 — the cooperative
thread pool is exhausted and `Task.sleep` never resumes to throw, so the
30s `callToolWithTimeout` deadline is silently missed. CI runs 72323677364,
72328816376, and 72345678664 all died that way: 10 minutes of no output,
then Swift Testing's outer `.timeLimit` kill, then the 15-minute GHA step
truncation — no diagnostic context produced.

This is Phase 3 of the implementation plan for issue #135.

Replace the task-group with a `withCheckedThrowingContinuation` + two
racers:

  1. Body path: ordinary Task that resumes the continuation on
     success/failure.
  2. Timeout path: `Thread.detachNewThread` (real kernel thread, not GCD,
     not Swift concurrency) that `Thread.sleep`s for the duration, then
     calls `process.terminate()` on the server subprocess and resumes
     the continuation with `TestTimeoutSentinel`.

`OSAllocatedUnfairLock<Bool>` guards exactly-once resume across the race.
`CheckedContinuation.resume` is a synchronous kernel primitive with no
cooperative-pool dependency, so the caller is unblocked regardless of
pool state.

The body Task may remain suspended forever if its pending MCP
continuation is never drained — acceptable, because the caller has
already been resumed via the continuation, the subprocess has been
terminated so the next test starts clean, and the process exits
shortly after.

New regression test `withTimeoutFiresUnderStarvation` occupies a
cooperative thread with `sleep(60)` (POSIX, no async availability
annotation unlike `Thread.sleep`) as the body, and asserts the pthread
still fires within `timeout + 3s` and terminates the subprocess.
Passes in 2.013s locally against a 2s budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ol-based SDK refs

Review of #136 flagged three items, all addressed here:

- Important: the regression test's dummy `/bin/sleep` subprocess inherited
  the parent's stdout/stderr. MCPTestServer.start()'s own comment warns
  this pattern previously wedged CI runners on macOS 15. Route dummy
  stdio to `FileHandle.nullDevice` to match the hardened pattern.
- Nit: `Duration.components.seconds + .attoseconds/1e18` was duplicated
  in withTimeout and the regression test. Extract `Duration.asTimeInterval`
  (internal to the test target, colocated with `MCPTestError`).
- Nit: withTimeout's docstring referenced swift-sdk `Client.swift:234-276`
  and `:304-342`. Those line numbers rot on every SDK bump. Replace with
  the stable symbols `Client.handleMessage(_:)`, `Client.disconnect()`,
  and `pendingRequests`.

Behavior unchanged; all 6 MacOSMCPTests still pass in ~68s locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p merged commit 8a25f5f into main Apr 21, 2026
4 checks passed
@obj-p obj-p deleted the fix/mcp-test-pthread-timeout branch April 21, 2026 21:46
obj-p added a commit that referenced this pull request Apr 22, 2026
Phase 4A (same PR, prior commit) added markers that caught the
post-reload wedge on the first CI run after merge. Run 72439165932's
server stderr dump showed:

  [snapshot] enter
  [snapshot] pre session-check
  [snapshot] post session-check     ← last marker before silence

The next marker `[snapshot] post 300ms sleep` never fires. The ONLY
code between those two markers is:

  guard isMacOSSession else { ... }           // no await
  try await Task.sleep(for: .milliseconds(300))  // ← wedges here

The hang is `Task.sleep`, which ultimately depends on Swift's
cooperative scheduler to resume. On a hot-reload test that polls
snapshots rapidly without swiftc breaks (the literal-only path in
particular — dozens of rapid snapshot calls per `awaitSnapshotChange`
loop), the pool accumulates enough load that the timer's continuation
stays unscheduled. Test `hotReloadLiteralOnly` hit this at around
the 8th snapshot cycle on CI run 72439165932.

This is exactly the failure mode issue #135 described — and the
reason PR #136 built a pthread-based timeout for the test harness.
Same pattern, different place.

The sleep itself is unnecessary: `Snapshot.capture` calls
`NSView.cacheDisplay(in:to:)`, which forces layout synchronously
when the view is dirty. The MCP client→server round trip already
gives the UI plenty of time to settle between view mount and
capture. Verified: all 7 MacOSMCPTests pass without the sleep —
including `hotReloadLiteralOnly` and `hotReloadStructural`, both
of which assert image bytes actually change after the reload.

The sleep was added in the initial commit (820b0cd) with the
generic comment "Wait briefly for SwiftUI to finish layout". No
evidence it was ever load-bearing; appears to have been defensive
paranoia that turned load-bearing only as an unintended starvation
trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
obj-p added a commit that referenced this pull request Apr 22, 2026
)

* Add targeted stderr instrumentation in preview_snapshot + reload path

Phase 4A of issue #135. Phases 1-3 catch the daemon post-reload wedge
(CLI stall-detects at 30s, test harness pthread-terminates at 30s),
but we still don't know WHERE the handler hangs. The server stderr
dump from PR #134 currently goes silent after "Reloaded!", leaving
root-cause analysis speculative.

This PR adds ~8 targeted `fputs` markers through `handlePreviewSnapshot`
and one after the reload's `MainActor.run` block. Each marker writes
to the server subprocess's stderr — captured in the per-instance
temp file and dumped by PR #134's post-failure CI step. When the
next flake hits, the CI log will show exactly how far the handler
got before the wedge:

  [snapshot] enter
  [snapshot] pre session-check          ← MainActor queue blocked here?
  [snapshot] post session-check
  [snapshot] post 300ms sleep           ← Task.sleep problem?
  [snapshot] main-actor capture enter   ← MainActor work starts
  [snapshot] main-actor capture done    ← Snapshot.capture returned
  [snapshot] encoding
  [snapshot] returning

Plus on the reload side, after the MainActor.run block containing
`loadPreview`:

  [reload] main-actor block returned

No behavior change. `fputs` + `fflush` is synchronous libc, no Swift
concurrency involvement, no allocations beyond the literal strings.
Verified locally: the full 7-stage sequence appears on the happy
path of `hotReloadStructural`.

Phase 4B (the actual fix) is gated on seeing what a real CI failure
shows. This PR is the diagnostic prerequisite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address review: add throw-path marker, TODO tag, user-visibility note

Round-1 review of #139 flagged three optional polishes worth applying
before merge:

- The `[snapshot] main-actor capture enter` marker fires on entry but
  the `captureFailed` throw path (when `host.window(for:)` returns nil)
  exits the MainActor block without a matching marker. Added a
  do/catch around the block with a `[snapshot] main-actor capture
  threw: ...` marker. This way the CI dump shows we exited the block
  one way or another, even on error paths.

- Added a `TODO(#135-4B)` tag on the instrumentation's doc comment so
  the cleanup obligation is grep-visible when Phase 4B lands.

- Expanded the doc comment with two clarifications the reviewer
  verified: (1) markers are not user-visible because
  `DaemonClient.registerStderrLogForwarder` forwards MCP
  `LogMessageNotification` payloads, not raw daemon stderr;
  (2) macOS snapshot handler only — iOS returns early above.

No behavior change. `swift test --filter hotReloadStructural` passes
in 8.5s; manually verified the new throw-path marker does not fire
on the happy path (only on the failure path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove 300ms pre-capture Task.sleep (Phase 4B root-cause fix for #135)

Phase 4A (same PR, prior commit) added markers that caught the
post-reload wedge on the first CI run after merge. Run 72439165932's
server stderr dump showed:

  [snapshot] enter
  [snapshot] pre session-check
  [snapshot] post session-check     ← last marker before silence

The next marker `[snapshot] post 300ms sleep` never fires. The ONLY
code between those two markers is:

  guard isMacOSSession else { ... }           // no await
  try await Task.sleep(for: .milliseconds(300))  // ← wedges here

The hang is `Task.sleep`, which ultimately depends on Swift's
cooperative scheduler to resume. On a hot-reload test that polls
snapshots rapidly without swiftc breaks (the literal-only path in
particular — dozens of rapid snapshot calls per `awaitSnapshotChange`
loop), the pool accumulates enough load that the timer's continuation
stays unscheduled. Test `hotReloadLiteralOnly` hit this at around
the 8th snapshot cycle on CI run 72439165932.

This is exactly the failure mode issue #135 described — and the
reason PR #136 built a pthread-based timeout for the test harness.
Same pattern, different place.

The sleep itself is unnecessary: `Snapshot.capture` calls
`NSView.cacheDisplay(in:to:)`, which forces layout synchronously
when the view is dirty. The MCP client→server round trip already
gives the UI plenty of time to settle between view mount and
capture. Verified: all 7 MacOSMCPTests pass without the sleep —
including `hotReloadLiteralOnly` and `hotReloadStructural`, both
of which assert image bytes actually change after the reload.

The sleep was added in the initial commit (820b0cd) with the
generic comment "Wait briefly for SwiftUI to finish layout". No
evidence it was ever load-bearing; appears to have been defensive
paranoia that turned load-bearing only as an unintended starvation
trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant