Skip to content

Detect daemon version mismatch and transparently restart (#142)#144

Merged
obj-p merged 4 commits intomainfrom
worktree-issue-142-version-handshake
Apr 30, 2026
Merged

Detect daemon version mismatch and transparently restart (#142)#144
obj-p merged 4 commits intomainfrom
worktree-issue-142-version-handshake

Conversation

@obj-p
Copy link
Copy Markdown
Owner

@obj-p obj-p commented Apr 22, 2026

Summary

  • CLI detects when it's talking to a daemon booted from a prior binary (e.g. brew upgrade left a v0.11 daemon running while the CLI is v0.12), kills it, and respawns — transparently, without the user running kill-daemon. Closes CLI talks to out-of-date daemon after upgrade (no version check) #142.
  • Handshake: DaemonClient.connect captures Initialize.Result, compares serverInfo.version to compile-time version, and on mismatch acquires an advisory flock on ~/.previewsmcp/restart.lock → always re-probes under the lock (so concurrent upgraded CLIs don't stampede) → SIGTERMs + respawns + reconnects. Single-shot retry; a second-handshake mismatch errors out naming the suspect env var.
  • versionsMatch(): strict equality when both sides carry a git-describe -<N>-g<SHA> suffix (dev builds honestly restart on protocol-affecting changes), base-only match when exactly one side is a dev build atop a release.
  • spawnDaemon unconditionally strips _PREVIEWSMCP_TEST_DAEMON_VERSION from the child env — closes a stray-shell-rc respawn-loop footgun regardless of whether tests are involved. Version-mismatch respawns propagate _PREVIEWSMCP_DAEMON_RESTART_REASON so the new daemon logs one breadcrumb to serve.log for bug-report diagnostics.

The v0.12 regression this fixes

v0.12.0 introduced a 30s client-side stall timer that expects heartbeat notifications. v0.11 daemons don't emit heartbeats, so any user who brew upgrades from 0.11→0.12 over a running daemon hits stall-disconnect errors on every command until they manually kill-daemon.

Collaboration trail

Design was agreed in-thread with a Plan subagent (two critique rounds; consensus summary lives at #142 (comment)). Implementation was independently reviewed by the code-reviewer subagent; findings that held up ("dev-to-dev builds should strict-compare", "pid aliasing hazard in isAlive assertion", "concurrent test's banner-count claim is weaker than it looks") were all addressed. Two "Critical" findings about kill-and-respawn races were pushed back on after verifying the kernel's FD-close-before-zombie-reap ordering.

Test plan

  • swift build — clean
  • swift-format lint --strict --recursive Sources/ Tests/ — clean
  • swift test --filter VersionHandshakeTests — 3/3 pass in ~2.8s:
    • happyPathRestart: stale daemon spawned via _PREVIEWSMCP_TEST_DAEMON_VERSION, CLI restarts it, stale pid-file-entry replaced with live new PID.
    • noRestartWhenVersionsMatch: two back-to-back CLIs against a daemon at the real version — no banner, same PID.
    • concurrentClientsRestartOnce: two parallel CLIs against a stale daemon → exactly one started after version-mismatch restart breadcrumb in serve.log.
  • swift test --filter CLIIntegrationTests — 69/69 pass.
  • swift test --filter 'PreviewsCLITests|MCPIntegrationTests.DaemonLifecycleTests' — 14/14 pass.
  • CI build-and-test — critical validation.

Deferred to follow-ups

  • SIGTERM-resistance + SIGKILL escalation in killDaemonAndWait (stands on its own as a kill-daemon hardening).
  • status --version surfacing the advertised daemon version.
  • Homebrew post-install hook running kill-daemon on upgrade (defense-in-depth for the Homebrew path specifically; already listed as follow-up in the issue body).

🤖 Generated with Claude Code

obj-p and others added 3 commits April 29, 2026 18:22
…142)

The daemon persists across CLI invocations (ADB-style), so after
`brew upgrade` the new CLI binary talks to the old daemon until
manually killed. v0.12.0 introduced a heartbeat protocol that old
daemons don't emit, so upgraded users hit 30s stall-disconnects on
every command.

DaemonClient.connect now captures the Initialize.Result
(previously discarded), compares serverInfo.version to the CLI's
compile-time version, and on mismatch:

  1. Acquires an advisory flock on ~/.previewsmcp/restart.lock so
     concurrent CLIs serialize instead of stampeding the daemon.
  2. Always re-probes under the lock — a sibling CLI may have
     already fixed the mismatch while we waited, in which case we
     use its fresh daemon instead of killing it.
  3. SIGTERMs the stale daemon, polls until gone, spawns fresh,
     reconnects. Guarded by a single-shot retry: if the new daemon
     still mismatches, errors naming _PREVIEWSMCP_TEST_DAEMON_VERSION
     as the suspect env var.

Version compare uses versionsMatch(): strict equality when both sides
carry the git-describe suffix (so dev-iteration builds honestly
restart on protocol-affecting changes), base-only match when exactly
one side is a dev build atop a release (so release-to-dev doesn't
thrash).

spawnDaemon unconditionally strips _PREVIEWSMCP_TEST_DAEMON_VERSION
from the child env so a stray shell-rc export can't cause respawn
loops. Version-mismatch respawns propagate _PREVIEWSMCP_DAEMON_
RESTART_REASON so the new daemon logs a one-line breadcrumb to
serve.log for diagnostics.

Three integration tests exercise happy-path restart, no-op when
versions match, and concurrent-CLI serialization (asserting exactly
one respawn breadcrumb lands in serve.log regardless of scheduling).

Deferred to follow-up: SIGTERM-resistance + SIGKILL escalation,
status --version surfacing, Homebrew post-install hook.

Closes #142.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…long connect

Two issues surfaced on CI that didn't repro locally:

1. Spawned daemon inherits the restart-lock fd
   acquireRestartLock() opened restart.lock with plain O_CREAT|O_RDWR
   and held the fd through the kill+respawn sequence. Process.run()
   in spawnDaemon() uses posix_spawn, which duplicates open fds into
   the child by default. The daemon grandchild ended up holding a
   dup of our flock fd. Per flock(2), a kernel-level flock is only
   released when *every* dup of the fd is closed — so our defer's
   close() didn't actually drop the lock, and the next CLI's
   acquireRestartLock blocked on flock(LOCK_EX) until the daemon
   died. CI saw a ~50s stall here on the concurrent-CLIs test.

   Fix: open with O_CLOEXEC so the spawned daemon doesn't inherit
   the fd.

2. StallTimer stale before watcher starts
   withDaemonClient creates the timer at function entry, then calls
   connect() (which on the restart path can take 25+s on CI while
   Compiler() initializes in the respawned daemon), then starts the
   stall watcher. By the time the watcher begins waiting, elapsed
   time against the 30s threshold was already past on CI — and the
   first heartbeat (T+2s post-connect) could land right on the
   boundary, leaving the next Task.sleep trivially fireable.

   Fix: bump the timer right before starting the watcher so the
   window begins from post-connect, not from withDaemonClient entry.

Both fixes are defensive and apply regardless of #142's specific
flow. Tests unchanged; they pass locally in ~2.8s and should now
pass on CI's slower runners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defense-in-depth for the Homebrew install path specifically. The
in-binary version handshake already handles every install path
(brew, SPM, source), but when the user is on brew we know
*before* the new CLI runs that an upgrade just happened, so we
can skip straight to killing the stale daemon and let the next
command come up cleanly against a fresh one.

Failure is swallowed (\`rescue nil\`): a wedged daemon that won't
respond to SIGTERM within 5s must not fail \`brew upgrade\`. If the
hook fails, the version handshake still surfaces a clear error on
the next CLI invocation and runs through its own kill+respawn.

No effect until the next release tag — formula ships on push of
\`v*\` via .github/workflows/release.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p force-pushed the worktree-issue-142-version-handshake branch from 8d6fb88 to c6a629b Compare April 29, 2026 22:23
… updates

Three small post-rebase cleanups in DaemonClient.swift, no behavior
change:

- `weJustSpawned` no longer needs the if/else ladder to set a Bool.
  `let weJustSpawned = !DaemonProbe.canConnect()` reads as what it
  is — "did we just spawn?" — and the spawn body conditional uses
  the same name without restating the predicate.

- `versionsMatch` and `baseVersion` were `static` (file-internal
  visibility). They're only called from inside `DaemonClient.swift`
  and have no callers in tests. `private` matches the rest of the
  file's helpers (`gitDescribeRange`, `openClient`, etc.) and
  prevents accidental cross-module use.

- `spawnDaemon`'s env update for `_PREVIEWSMCP_DAEMON_RESTART_REASON`
  used `if let reason ... else ... removeValue`. Swift's
  `Dictionary[key] = nil` removes the key, so a single
  `env[key] = restartReason` (an `Optional<String>`) handles both
  branches. Same for the `_PREVIEWSMCP_TEST_DAEMON_VERSION` strip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p merged commit 1b27286 into main Apr 30, 2026
4 checks passed
@obj-p obj-p deleted the worktree-issue-142-version-handshake branch April 30, 2026 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLI talks to out-of-date daemon after upgrade (no version check)

1 participant