Skip to content

Make MCP hot-reload test diagnosable; split by reload path#126

Merged
obj-p merged 1 commit intomainfrom
fix/mcp-hotreload-test-diagnostics
Apr 19, 2026
Merged

Make MCP hot-reload test diagnosable; split by reload path#126
obj-p merged 1 commit intomainfrom
fix/mcp-hotreload-test-diagnostics

Conversation

@obj-p
Copy link
Copy Markdown
Owner

@obj-p obj-p commented Apr 19, 2026

Summary

PR #124's CI hang (10-min timeout, 9+ min of zero output from hotReload in MacOSMCPTests) was a flake, but the test's design made it effectively undebuggable: the server subprocess's stderr was sent to /dev/null, synchronization was a blind 5s sleep, the step and per-test timeouts raced, and assertValidImage only checked header bytes — a silent no-op reload would pass. This PR addresses all four.

Test harness (MCPTestServer.swift)

  • Capture subprocess stderr to a per-instance temp file. A plain file handle avoids both pitfalls documented in the prior /dev/null comment (dispatch-source retention; shared-fd NSApplication interference).
  • Signal-based sync helpers: awaitStderrContains + stderrLog let tests wait on a real event and surface diagnostics on timeout via Issue.record.
  • withTimeout primitive + callToolWithTimeout so any single MCP call localizes a hang instead of voiding the whole test.
  • snapshotBytes + awaitSnapshotChange so reload tests can assert rendered pixels actually changed, not just that the snapshot endpoint works.

Test split (MacOSMCPTests.swift)

The original test edited a string literal ("My Items""Tasks"), which hits the DesignTimeStore fast path and never exercises swiftc. Now split into:

  • hotReloadLiteralOnly — edits "Progress""Totals" (a literal in a visibly-rendered SummaryCard), waits for "Literal-only change:", asserts bytes differ.
  • hotReloadStructural — appends .padding(32) after .navigationTitle, forcing .structural from LiteralDiffer, waits for "Compiled:", asserts bytes differ.

CI (ci.yml)

  • Bump MCP integration tests step timeout 10 → 15 min so .timeLimit(.minutes(10)) fires first on future hangs with a source-located Testing-library error.

Test plan

  • swift test --filter "MacOSMCPTests" locally — 5 tests pass in 59s total.
  • Inspected a /tmp/previewsmcp-test-*.log from a run: captures "MCP server starting on stdio…", "Literal-only change: 1 value(s)", "Applied 1 literal change(s)…".
  • Forced a failure (temporarily mis-targeted the byte-diff) and confirmed the Issue.record dump printed the full server stderr log, not a blanket GHA timeout.
  • CI passes.

Follow-up: #125 (inline snapshots in terminals) is unchanged by this PR.

The hotReload test in MacOSMCPTests hung for 9+ minutes on a recent CI
run with zero output: the server subprocess's stderr was hardcoded to
/dev/null, the test used a blind 5s sleep instead of a sync signal, and
`assertValidImage` only checked header bytes (so a silent no-op reload
could pass). The step timeout and per-test .timeLimit were both 10 min,
so GHA killed the job before Swift Testing produced a diagnosable error.

Changes to MCPTestServer:
- Capture subprocess stderr to a per-instance temp file. Writing to a
  plain file handle avoids both pitfalls documented in the prior
  /dev/null comment (Pipe+readabilityHandler CFRunLoop retention;
  shared-stderr NSApplication interference).
- Add awaitStderrContains / stderrLog for signal-based sync and
  diagnostic dumps.
- Add withTimeout primitive + callToolWithTimeout wrapper so any
  single MCP call localizes a hang instead of letting the whole test
  void out.
- Add snapshotBytes / awaitSnapshotChange so reload tests can assert
  the rendered pixels actually changed.

Changes to MacOSMCPTests:
- Split hotReload into hotReloadLiteralOnly (fast path, "Literal-only
  change:" sync) and hotReloadStructural (slow path, "Compiled:" sync).
  Both assert snapshot bytes differ from a pre-edit baseline.
- Replace blind sleeps with the new stderr-signal + snapshot-change
  polling helpers.

Changes to ci.yml:
- Bump the "MCP integration tests" step to 15 min so the per-test
  .timeLimit(.minutes(10)) fires first on any hang, producing a
  source-located Testing error.

Local run: 5 tests in MacOSMCPTests pass in 59s total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p merged commit 3a8f8cf into main Apr 19, 2026
7 of 8 checks passed
@obj-p obj-p deleted the fix/mcp-hotreload-test-diagnostics branch April 19, 2026 21:48
obj-p added a commit that referenced this pull request Apr 20, 2026
Issue #127 tracked two CLI integration test flakes that escalated into
a 6-test cascade on PR #124's rerun. Root cause: CLIRunner.runProcess
used withCheckedThrowingContinuation with no cancellation handler. When
.timeLimit(.minutes(20)) fired, the test Task was cancelled and the
DaemonTestLock unlock ran — but the subprocess kept running and the
daemon stayed busy serving it. The next test acquired the released lock,
tried to talk to the still-busy daemon, and hung.

This ports PR #126's MCP-test playbook to CLI integration tests:
diagnostics first, cancellation safety, source-located failures.

CLIRunner (Tests/CLIIntegrationTests/CLIRunner.swift):
- Per-subcommand timeout table: 60s default; start/run 240s; kill-daemon
  10s. Override per call via the new `timeout:` parameter.
- runProcess wrapped in withTaskCancellationHandler. State machine
  (OSAllocatedUnfairLock<RunState>) guards the continuation against
  double-resume across the four terminal paths (normal exit, timeout,
  cancellation, run() throw). On cancellation/timeout: SIGTERM, then
  SIGKILL after a 2s grace.
- Stderr captured to a per-call temp file (mirrors MCPTestServer's
  pattern from PR #126: avoids both the CFRunLoop-retention bug of
  Pipe+readabilityHandler and the shared-stderr NSApplication startup
  hang of inheriting FileHandle.standardError).
- On timeout: subprocess stderr + the daemon's serve.log are dumped via
  Issue.record so the next CI failure produces actionable diagnostics
  instead of just a step-kill.

DaemonTestLock (Tests/CLIIntegrationTests/DaemonTestLock.swift):
- Optional `context: String` parameter (defaults to #function) is written
  into serve.log as a "=== TEST: <ctx> @ <iso8601> ===" marker so CI
  dumps are greppable to the failing test's window.
- serve.log is truncated once per process (didTruncateLog flag) so a
  failure dump shows just the failing run rather than accumulated
  history. Truncation runs INSIDE the lock so concurrent suites can't
  race.

SnapshotCommandTests:
- Added `cleanSlate()` (kill-daemon --timeout 2) parity with
  SwitchCommandTests. Without it, a wedged daemon from a prior test
  silently corrupts subsequent snapshots; with A1's subprocess kill in
  place, the wedged daemon would still persist between tests until the
  next cleanSlate.

MCPServer.handlePreviewSwitch (Sources/PreviewsCLI/MCPServer.swift):
- Added daemon-side bounds check on previewIndex in BOTH iOS and macOS
  paths before delegating to switchPreview. Returns isError=true with
  "Preview index N out of range (available: 0..<M)". The compile path
  already validated, but the observed `switch 99` hang was upstream of
  switchPreview entirely — this belt guarantees a fast structured
  failure regardless. SwitchCommandTests.switchOutOfRange now asserts
  on the exact substring so a regression that bypasses this check fails
  loud rather than silently passing on any non-zero exit.

CI (.github/workflows/ci.yml):
- "Dump daemon log on CLI test failure" step added after the CLI
  integration tests step in the build-and-test job (mirror of the iOS
  job's existing block). On step failure, dumps $PREVIEWSMCP_SOCKET_DIR
  /serve.log + lists the socket dir.
- CLI step timeout cut from 60m → 15m. Per-test .timeLimit cut from 20m
  → 5m (10m for the two iOS-simulator tests). Warm-cache local wall-
  clock is ~3 min for 61 tests with the slowest individual test at 33s
  and slowest body times around 8s. The new bounds preserve substantial
  cold-cache headroom while ensuring that on a hang the per-test
  .timeLimit fires first with a source-located Testing error rather
  than a bare GHA step kill.

Verification:
- swift test --filter CLIIntegrationTests --skip snapshotIOS --skip
  iosCLIWorkflow → 61/61 green in 174s.
- swift test --filter PreviewsCoreTests → 204/204 green.
- Negative test: `CLIRunner.run("serve", timeout: .seconds(3))` failed
  at exactly 3.007s with the diagnostic dump and SIGTERM reaching the
  daemon — confirms the timeout + kill path end-to-end.

Stage B (the underlying color-scheme bug) and the upstream switch hang
are deliberately unfixed here — they require a diagnostic-rich
reproduction to investigate without speculation. The infrastructure
landed here makes that reproduction possible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant