Detect daemon version mismatch and transparently restart (#142)#144
Merged
Detect daemon version mismatch and transparently restart (#142)#144
Conversation
…142) The daemon persists across CLI invocations (ADB-style), so after `brew upgrade` the new CLI binary talks to the old daemon until manually killed. v0.12.0 introduced a heartbeat protocol that old daemons don't emit, so upgraded users hit 30s stall-disconnects on every command. DaemonClient.connect now captures the Initialize.Result (previously discarded), compares serverInfo.version to the CLI's compile-time version, and on mismatch: 1. Acquires an advisory flock on ~/.previewsmcp/restart.lock so concurrent CLIs serialize instead of stampeding the daemon. 2. Always re-probes under the lock — a sibling CLI may have already fixed the mismatch while we waited, in which case we use its fresh daemon instead of killing it. 3. SIGTERMs the stale daemon, polls until gone, spawns fresh, reconnects. Guarded by a single-shot retry: if the new daemon still mismatches, errors naming _PREVIEWSMCP_TEST_DAEMON_VERSION as the suspect env var. Version compare uses versionsMatch(): strict equality when both sides carry the git-describe suffix (so dev-iteration builds honestly restart on protocol-affecting changes), base-only match when exactly one side is a dev build atop a release (so release-to-dev doesn't thrash). spawnDaemon unconditionally strips _PREVIEWSMCP_TEST_DAEMON_VERSION from the child env so a stray shell-rc export can't cause respawn loops. Version-mismatch respawns propagate _PREVIEWSMCP_DAEMON_ RESTART_REASON so the new daemon logs a one-line breadcrumb to serve.log for diagnostics. Three integration tests exercise happy-path restart, no-op when versions match, and concurrent-CLI serialization (asserting exactly one respawn breadcrumb lands in serve.log regardless of scheduling). Deferred to follow-up: SIGTERM-resistance + SIGKILL escalation, status --version surfacing, Homebrew post-install hook. Closes #142. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…long connect Two issues surfaced on CI that didn't repro locally: 1. Spawned daemon inherits the restart-lock fd acquireRestartLock() opened restart.lock with plain O_CREAT|O_RDWR and held the fd through the kill+respawn sequence. Process.run() in spawnDaemon() uses posix_spawn, which duplicates open fds into the child by default. The daemon grandchild ended up holding a dup of our flock fd. Per flock(2), a kernel-level flock is only released when *every* dup of the fd is closed — so our defer's close() didn't actually drop the lock, and the next CLI's acquireRestartLock blocked on flock(LOCK_EX) until the daemon died. CI saw a ~50s stall here on the concurrent-CLIs test. Fix: open with O_CLOEXEC so the spawned daemon doesn't inherit the fd. 2. StallTimer stale before watcher starts withDaemonClient creates the timer at function entry, then calls connect() (which on the restart path can take 25+s on CI while Compiler() initializes in the respawned daemon), then starts the stall watcher. By the time the watcher begins waiting, elapsed time against the 30s threshold was already past on CI — and the first heartbeat (T+2s post-connect) could land right on the boundary, leaving the next Task.sleep trivially fireable. Fix: bump the timer right before starting the watcher so the window begins from post-connect, not from withDaemonClient entry. Both fixes are defensive and apply regardless of #142's specific flow. Tests unchanged; they pass locally in ~2.8s and should now pass on CI's slower runners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defense-in-depth for the Homebrew install path specifically. The in-binary version handshake already handles every install path (brew, SPM, source), but when the user is on brew we know *before* the new CLI runs that an upgrade just happened, so we can skip straight to killing the stale daemon and let the next command come up cleanly against a fresh one. Failure is swallowed (\`rescue nil\`): a wedged daemon that won't respond to SIGTERM within 5s must not fail \`brew upgrade\`. If the hook fails, the version handshake still surfaces a clear error on the next CLI invocation and runs through its own kill+respawn. No effect until the next release tag — formula ships on push of \`v*\` via .github/workflows/release.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8d6fb88 to
c6a629b
Compare
… updates Three small post-rebase cleanups in DaemonClient.swift, no behavior change: - `weJustSpawned` no longer needs the if/else ladder to set a Bool. `let weJustSpawned = !DaemonProbe.canConnect()` reads as what it is — "did we just spawn?" — and the spawn body conditional uses the same name without restating the predicate. - `versionsMatch` and `baseVersion` were `static` (file-internal visibility). They're only called from inside `DaemonClient.swift` and have no callers in tests. `private` matches the rest of the file's helpers (`gitDescribeRange`, `openClient`, etc.) and prevents accidental cross-module use. - `spawnDaemon`'s env update for `_PREVIEWSMCP_DAEMON_RESTART_REASON` used `if let reason ... else ... removeValue`. Swift's `Dictionary[key] = nil` removes the key, so a single `env[key] = restartReason` (an `Optional<String>`) handles both branches. Same for the `_PREVIEWSMCP_TEST_DAEMON_VERSION` strip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
brew upgradeleft a v0.11 daemon running while the CLI is v0.12), kills it, and respawns — transparently, without the user runningkill-daemon. Closes CLI talks to out-of-date daemon after upgrade (no version check) #142.DaemonClient.connectcapturesInitialize.Result, comparesserverInfo.versionto compile-time version, and on mismatch acquires an advisoryflockon~/.previewsmcp/restart.lock→ always re-probes under the lock (so concurrent upgraded CLIs don't stampede) → SIGTERMs + respawns + reconnects. Single-shot retry; a second-handshake mismatch errors out naming the suspect env var.versionsMatch(): strict equality when both sides carry a git-describe-<N>-g<SHA>suffix (dev builds honestly restart on protocol-affecting changes), base-only match when exactly one side is a dev build atop a release.spawnDaemonunconditionally strips_PREVIEWSMCP_TEST_DAEMON_VERSIONfrom the child env — closes a stray-shell-rc respawn-loop footgun regardless of whether tests are involved. Version-mismatch respawns propagate_PREVIEWSMCP_DAEMON_RESTART_REASONso the new daemon logs one breadcrumb toserve.logfor bug-report diagnostics.The v0.12 regression this fixes
v0.12.0 introduced a 30s client-side stall timer that expects heartbeat notifications. v0.11 daemons don't emit heartbeats, so any user who
brew upgrades from 0.11→0.12 over a running daemon hits stall-disconnect errors on every command until they manuallykill-daemon.Collaboration trail
Design was agreed in-thread with a Plan subagent (two critique rounds; consensus summary lives at #142 (comment)). Implementation was independently reviewed by the code-reviewer subagent; findings that held up ("dev-to-dev builds should strict-compare", "pid aliasing hazard in
isAliveassertion", "concurrent test's banner-count claim is weaker than it looks") were all addressed. Two "Critical" findings about kill-and-respawn races were pushed back on after verifying the kernel's FD-close-before-zombie-reap ordering.Test plan
swift build— cleanswift-format lint --strict --recursive Sources/ Tests/— cleanswift test --filter VersionHandshakeTests— 3/3 pass in ~2.8s:happyPathRestart: stale daemon spawned via_PREVIEWSMCP_TEST_DAEMON_VERSION, CLI restarts it, stale pid-file-entry replaced with live new PID.noRestartWhenVersionsMatch: two back-to-back CLIs against a daemon at the real version — no banner, same PID.concurrentClientsRestartOnce: two parallel CLIs against a stale daemon → exactly onestarted after version-mismatch restartbreadcrumb inserve.log.swift test --filter CLIIntegrationTests— 69/69 pass.swift test --filter 'PreviewsCLITests|MCPIntegrationTests.DaemonLifecycleTests'— 14/14 pass.build-and-test— critical validation.Deferred to follow-ups
killDaemonAndWait(stands on its own as akill-daemonhardening).status --versionsurfacing the advertised daemon version.kill-daemonon upgrade (defense-in-depth for the Homebrew path specifically; already listed as follow-up in the issue body).🤖 Generated with Claude Code