fix(relay): ignore stale websocket callbacks#696
Conversation
Adds an observable `ConnectionState` to the relay singleton and a status indicator next to the workspace name so users can tell when the relay is unreachable. Addresses Alec's bug report about Warp's half-connected state silently breaking Sprout. The half-open case (Warp orange icon, asleep VPN, etc.) is the interesting one: tungstenite auto-pongs and the OS keeps the TCP socket open, so the WS layer reports nothing wrong. We add an app-level liveness probe — a periodic NIP-01 REQ with a filter that matches nothing real — and treat a missing EOSE within ~30s as a stalled socket. When stalled the watchdog tears down the WS so the existing reconnect path runs. UI: when state is `reconnecting`, `stalled`, or `disconnected` (after a 2s debounce to avoid flashing on brief blips), the 🌱 emoji is replaced with a pulsing red WifiOff icon, the workspace name turns red, and a tooltip explains what's happening. Split out of relayClientSession.ts: - relayConnectionStateEmitter.ts — pure observable - relayStallWatchdog.ts — probe + onStall callback Tests: 16 new unit tests covering the emitter, watchdog probe shape, EOSE handling, timeout/send-failure stalls, stop() cancellation, and idempotent start(). Refs Sprout bug: relay connection loss warning Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com> Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Max's review of #623 caught a correctness bug. When the relay rejects AUTH (kind:22242 OK=false), handleOk calls resetConnection(err, { reconnect: false }) which sets state to 'disconnected' and clears the reconnect timer. BUT — the reconnect timer's catch handler (`void this.ensureConnected().catch(() => this.scheduleReconnect())`) and the retry wrappers in publishEvent / sendRawWithReconnectRetry would then immediately race the disconnected state back to 'reconnecting'. Same hazard from any future call that goes through ensureConnected(). Fix: add a sticky 'terminal' latch. - Set in resetConnection when reconnect:false. - Guards scheduleReconnect() (no new timers) and ensureConnected() (throws a terminal error). - Cleared on explicit re-engagement: disconnect() (workspace switch) and preconnect() (caller is asking us to come back up). While here: - Extract the reconnect/refuse predicates to a pure relayReconnectPolicy module so the state-machine rules live in one legible place — also addresses Max's elegance note about distributing edge cases across connect/reset/scheduleReconnect. - Fix the stale 'two consecutive misses' doc comment — implementation stalls on a single missed probe (or send failure) within the STALL_PROBE_TIMEOUT_MS window. 8 new unit tests cover the policy: baseline schedules, terminal wins over every other reason, pending timer suppresses, live socket suppresses, no-subs+no-keepalive idles, keep-alive alone is enough, and shouldRefuseConnect mirrors terminal. Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com> Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
|
Dawn found a deeper root cause in tauri-plugin-websocket: its send command holds the global ConnectionManager mutex across write.send(...).await. The watchdog probe introduced in #623 can block there during a half-open WARP connection, preventing new connects from registering their writer/read loop. This PR's generation guard is still a valid reconnect hardening fix, but by itself it likely won't fix the WARP repro. I'm leaving it open as a small follow-up, but it should probably land after/alongside the watchdog/plugin fix rather than be treated as the primary fix. |
|
Closing in favor of the combined/ideal fix in #697. The generation guard from this PR is included there, alongside the passive watchdog fix that addresses the WARP reconnect wedge. |
Summary
Why
After a watchdog-triggered reset, the old websocket callback can still deliver Close/Error/AUTH later. Without fencing, that stale callback can call resetConnection() against the newly connected wsId and tear down the fresh socket, leaving the UI permanently degraded/stale after Warp reconnects.
Verification