Skip to content

fix(relay): ignore stale websocket callbacks#696

Closed
tlongwell-block wants to merge 3 commits into
mainfrom
max/relay-reconnect-generation-guard
Closed

fix(relay): ignore stale websocket callbacks#696
tlongwell-block wants to merge 3 commits into
mainfrom
max/relay-reconnect-generation-guard

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

Summary

  • add a connection generation guard to the desktop relay client
  • ignore websocket messages delivered by stale Tauri Channels after reset/reconnect/disconnect
  • re-check generation after async AUTH event creation before sending AUTH on the current socket

Why

After a watchdog-triggered reset, the old websocket callback can still deliver Close/Error/AUTH later. Without fencing, that stale callback can call resetConnection() against the newly connected wsId and tear down the fresh socket, leaving the UI permanently degraded/stale after Warp reconnects.

Verification

  • pnpm --dir desktop typecheck
  • pnpm --dir desktop check
  • node --test desktop/src/shared/api/relayReconnectPolicy.test.mjs desktop/src/shared/api/relayStallWatchdog.test.mjs desktop/src/shared/api/relayConnectionStateEmitter.test.mjs desktop/src/shared/api/relayClientShared.test.mjs
  • pre-commit hook: desktop-check, desktop-tauri-fmt, mobile-check, rust-fmt, web-check
  • pre-push hook: rust-fmt, web-check, desktop-tauri-fmt, desktop-check, rust-clippy, rust-tests, web-build, mobile-check, desktop-build, mobile-test, desktop-tauri-check

tlongwell-block and others added 3 commits May 19, 2026 22:47
Adds an observable `ConnectionState` to the relay singleton and a status
indicator next to the workspace name so users can tell when the relay is
unreachable. Addresses Alec's bug report about Warp's half-connected
state silently breaking Sprout.

The half-open case (Warp orange icon, asleep VPN, etc.) is the
interesting one: tungstenite auto-pongs and the OS keeps the TCP socket
open, so the WS layer reports nothing wrong. We add an app-level
liveness probe — a periodic NIP-01 REQ with a filter that matches
nothing real — and treat a missing EOSE within ~30s as a stalled
socket. When stalled the watchdog tears down the WS so the existing
reconnect path runs.

UI: when state is `reconnecting`, `stalled`, or `disconnected` (after a
2s debounce to avoid flashing on brief blips), the 🌱 emoji is replaced
with a pulsing red WifiOff icon, the workspace name turns red, and a
tooltip explains what's happening.

Split out of relayClientSession.ts:
  - relayConnectionStateEmitter.ts — pure observable
  - relayStallWatchdog.ts          — probe + onStall callback

Tests: 16 new unit tests covering the emitter, watchdog probe shape,
EOSE handling, timeout/send-failure stalls, stop() cancellation, and
idempotent start().

Refs Sprout bug: relay connection loss warning

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Max's review of #623 caught a correctness bug. When the relay rejects
AUTH (kind:22242 OK=false), handleOk calls
resetConnection(err, { reconnect: false }) which sets state to
'disconnected' and clears the reconnect timer. BUT — the reconnect
timer's catch handler (`void this.ensureConnected().catch(() =>
this.scheduleReconnect())`) and the retry wrappers in publishEvent /
sendRawWithReconnectRetry would then immediately race the disconnected
state back to 'reconnecting'. Same hazard from any future call that
goes through ensureConnected().

Fix: add a sticky 'terminal' latch.
- Set in resetConnection when reconnect:false.
- Guards scheduleReconnect() (no new timers) and ensureConnected()
  (throws a terminal error).
- Cleared on explicit re-engagement: disconnect() (workspace switch)
  and preconnect() (caller is asking us to come back up).

While here:
- Extract the reconnect/refuse predicates to a pure relayReconnectPolicy
  module so the state-machine rules live in one legible place — also
  addresses Max's elegance note about distributing edge cases across
  connect/reset/scheduleReconnect.
- Fix the stale 'two consecutive misses' doc comment — implementation
  stalls on a single missed probe (or send failure) within the
  STALL_PROBE_TIMEOUT_MS window.

8 new unit tests cover the policy: baseline schedules, terminal wins
over every other reason, pending timer suppresses, live socket
suppresses, no-subs+no-keepalive idles, keep-alive alone is enough,
and shouldRefuseConnect mirrors terminal.

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block requested a review from a team as a code owner May 21, 2026 14:29
@tlongwell-block
Copy link
Copy Markdown
Collaborator Author

Dawn found a deeper root cause in tauri-plugin-websocket: its send command holds the global ConnectionManager mutex across write.send(...).await. The watchdog probe introduced in #623 can block there during a half-open WARP connection, preventing new connects from registering their writer/read loop. This PR's generation guard is still a valid reconnect hardening fix, but by itself it likely won't fix the WARP repro. I'm leaving it open as a small follow-up, but it should probably land after/alongside the watchdog/plugin fix rather than be treated as the primary fix.

@tlongwell-block
Copy link
Copy Markdown
Collaborator Author

Closing in favor of the combined/ideal fix in #697. The generation guard from this PR is included there, alongside the passive watchdog fix that addresses the WARP reconnect wedge.

@tlongwell-block tlongwell-block deleted the max/relay-reconnect-generation-guard branch May 22, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant