Make MCP hot-reload test diagnosable; split by reload path#126
Merged
Conversation
The hotReload test in MacOSMCPTests hung for 9+ minutes on a recent CI run with zero output: the server subprocess's stderr was hardcoded to /dev/null, the test used a blind 5s sleep instead of a sync signal, and `assertValidImage` only checked header bytes (so a silent no-op reload could pass). The step timeout and per-test .timeLimit were both 10 min, so GHA killed the job before Swift Testing produced a diagnosable error. Changes to MCPTestServer: - Capture subprocess stderr to a per-instance temp file. Writing to a plain file handle avoids both pitfalls documented in the prior /dev/null comment (Pipe+readabilityHandler CFRunLoop retention; shared-stderr NSApplication interference). - Add awaitStderrContains / stderrLog for signal-based sync and diagnostic dumps. - Add withTimeout primitive + callToolWithTimeout wrapper so any single MCP call localizes a hang instead of letting the whole test void out. - Add snapshotBytes / awaitSnapshotChange so reload tests can assert the rendered pixels actually changed. Changes to MacOSMCPTests: - Split hotReload into hotReloadLiteralOnly (fast path, "Literal-only change:" sync) and hotReloadStructural (slow path, "Compiled:" sync). Both assert snapshot bytes differ from a pre-edit baseline. - Replace blind sleeps with the new stderr-signal + snapshot-change polling helpers. Changes to ci.yml: - Bump the "MCP integration tests" step to 15 min so the per-test .timeLimit(.minutes(10)) fires first on any hang, producing a source-located Testing error. Local run: 5 tests in MacOSMCPTests pass in 59s total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
obj-p
added a commit
that referenced
this pull request
Apr 20, 2026
Issue #127 tracked two CLI integration test flakes that escalated into a 6-test cascade on PR #124's rerun. Root cause: CLIRunner.runProcess used withCheckedThrowingContinuation with no cancellation handler. When .timeLimit(.minutes(20)) fired, the test Task was cancelled and the DaemonTestLock unlock ran — but the subprocess kept running and the daemon stayed busy serving it. The next test acquired the released lock, tried to talk to the still-busy daemon, and hung. This ports PR #126's MCP-test playbook to CLI integration tests: diagnostics first, cancellation safety, source-located failures. CLIRunner (Tests/CLIIntegrationTests/CLIRunner.swift): - Per-subcommand timeout table: 60s default; start/run 240s; kill-daemon 10s. Override per call via the new `timeout:` parameter. - runProcess wrapped in withTaskCancellationHandler. State machine (OSAllocatedUnfairLock<RunState>) guards the continuation against double-resume across the four terminal paths (normal exit, timeout, cancellation, run() throw). On cancellation/timeout: SIGTERM, then SIGKILL after a 2s grace. - Stderr captured to a per-call temp file (mirrors MCPTestServer's pattern from PR #126: avoids both the CFRunLoop-retention bug of Pipe+readabilityHandler and the shared-stderr NSApplication startup hang of inheriting FileHandle.standardError). - On timeout: subprocess stderr + the daemon's serve.log are dumped via Issue.record so the next CI failure produces actionable diagnostics instead of just a step-kill. DaemonTestLock (Tests/CLIIntegrationTests/DaemonTestLock.swift): - Optional `context: String` parameter (defaults to #function) is written into serve.log as a "=== TEST: <ctx> @ <iso8601> ===" marker so CI dumps are greppable to the failing test's window. - serve.log is truncated once per process (didTruncateLog flag) so a failure dump shows just the failing run rather than accumulated history. Truncation runs INSIDE the lock so concurrent suites can't race. SnapshotCommandTests: - Added `cleanSlate()` (kill-daemon --timeout 2) parity with SwitchCommandTests. Without it, a wedged daemon from a prior test silently corrupts subsequent snapshots; with A1's subprocess kill in place, the wedged daemon would still persist between tests until the next cleanSlate. MCPServer.handlePreviewSwitch (Sources/PreviewsCLI/MCPServer.swift): - Added daemon-side bounds check on previewIndex in BOTH iOS and macOS paths before delegating to switchPreview. Returns isError=true with "Preview index N out of range (available: 0..<M)". The compile path already validated, but the observed `switch 99` hang was upstream of switchPreview entirely — this belt guarantees a fast structured failure regardless. SwitchCommandTests.switchOutOfRange now asserts on the exact substring so a regression that bypasses this check fails loud rather than silently passing on any non-zero exit. CI (.github/workflows/ci.yml): - "Dump daemon log on CLI test failure" step added after the CLI integration tests step in the build-and-test job (mirror of the iOS job's existing block). On step failure, dumps $PREVIEWSMCP_SOCKET_DIR /serve.log + lists the socket dir. - CLI step timeout cut from 60m → 15m. Per-test .timeLimit cut from 20m → 5m (10m for the two iOS-simulator tests). Warm-cache local wall- clock is ~3 min for 61 tests with the slowest individual test at 33s and slowest body times around 8s. The new bounds preserve substantial cold-cache headroom while ensuring that on a hang the per-test .timeLimit fires first with a source-located Testing error rather than a bare GHA step kill. Verification: - swift test --filter CLIIntegrationTests --skip snapshotIOS --skip iosCLIWorkflow → 61/61 green in 174s. - swift test --filter PreviewsCoreTests → 204/204 green. - Negative test: `CLIRunner.run("serve", timeout: .seconds(3))` failed at exactly 3.007s with the diagnostic dump and SIGTERM reaching the daemon — confirms the timeout + kill path end-to-end. Stage B (the underlying color-scheme bug) and the upstream switch hang are deliberately unfixed here — they require a diagnostic-rich reproduction to investigate without speculation. The infrastructure landed here makes that reproduction possible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 20, 2026
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #124's CI hang (10-min timeout, 9+ min of zero output from
hotReloadinMacOSMCPTests) was a flake, but the test's design made it effectively undebuggable: the server subprocess's stderr was sent to/dev/null, synchronization was a blind 5s sleep, the step and per-test timeouts raced, andassertValidImageonly checked header bytes — a silent no-op reload would pass. This PR addresses all four.Test harness (
MCPTestServer.swift)/dev/nullcomment (dispatch-source retention; shared-fd NSApplication interference).awaitStderrContains+stderrLoglet tests wait on a real event and surface diagnostics on timeout viaIssue.record.withTimeoutprimitive +callToolWithTimeoutso any single MCP call localizes a hang instead of voiding the whole test.snapshotBytes+awaitSnapshotChangeso reload tests can assert rendered pixels actually changed, not just that the snapshot endpoint works.Test split (
MacOSMCPTests.swift)The original test edited a string literal (
"My Items"→"Tasks"), which hits the DesignTimeStore fast path and never exercisesswiftc. Now split into:hotReloadLiteralOnly— edits"Progress"→"Totals"(a literal in a visibly-rendered SummaryCard), waits for"Literal-only change:", asserts bytes differ.hotReloadStructural— appends.padding(32)after.navigationTitle, forcing.structuralfromLiteralDiffer, waits for"Compiled:", asserts bytes differ.CI (
ci.yml)MCP integration testsstep timeout 10 → 15 min so.timeLimit(.minutes(10))fires first on future hangs with a source-located Testing-library error.Test plan
swift test --filter "MacOSMCPTests"locally — 5 tests pass in 59s total./tmp/previewsmcp-test-*.logfrom a run: captures "MCP server starting on stdio…", "Literal-only change: 1 value(s)", "Applied 1 literal change(s)…".Issue.recorddump printed the full server stderr log, not a blanket GHA timeout.Follow-up: #125 (inline snapshots in terminals) is unchanged by this PR.