fix(socket): refresh session token on Invalid token rejection (TAURI-RUST-9C)#2896
fix(socket): refresh session token on Invalid token rejection (TAURI-RUST-9C)#2896graycyrus wants to merge 4 commits into
Conversation
…RUST-9C)
After a disconnect, the ws_loop reconnected using the same token captured at
`SocketManager::connect` time. When the token had been rotated/expired, the
server returned `44{"message":"Invalid token"}` — but the loop treated this
as a regular `Failed` outcome and retried with the same stale credential 5
times before declaring "sustained outage" and firing a Sentry event.
Fix:
- Add `ConnectionOutcome::InvalidToken` to distinguish auth rejection from
transient transport failures.
- In `run_connection`, detect "Invalid token" in the SIO connect-error message
via the new `is_invalid_token_error` helper (case-insensitive) and return
`InvalidToken` instead of `Failed`.
- In `ws_loop`, on `InvalidToken`: call `refresh_session_token()` which loads
config + reads the current stored JWT (same path as
`handle_connect_with_session`). On success, update the in-loop token and
`continue` immediately (no backoff sleep). On failure (no stored token, user
logged out), fall back to `Failed` accounting so the Sentry escalation still
fires if the auth problem persists.
- Add unit tests for `is_invalid_token_error` and an end-to-end
`run_connection` test against a mock EIO server that sends the exact
production wire payload `44{"message":"Invalid token"}`.
Closes tinyhumansai#2892
📝 WalkthroughWalkthroughDetects Socket.IO "Invalid token" connect ACKs as ConnectionOutcome::InvalidToken, refreshes the session token from the credential store, updates the in-loop token, and immediately retries the connection on a successful refresh; otherwise falls back to the existing failure/backoff path. ChangesSocket Token Refresh & Recovery
Sequence Diagram(s)sequenceDiagram
participant WsLoop as ws_loop
participant RunConn as run_connection
participant SocketIO as Socket.IO Server
participant Detector as is_invalid_token_error()
participant TokenRefresh as refresh_session_token()
participant CredStore as Credential Store
WsLoop->>RunConn: attempt connection with current token
RunConn->>SocketIO: CONNECT with auth token
SocketIO-->>RunConn: CONNECT ACK error: "Invalid token"
RunConn->>Detector: classify error message
Detector-->>RunConn: true (case-insensitive match)
RunConn-->>WsLoop: ConnectionOutcome::InvalidToken
WsLoop->>WsLoop: reset failure/backoff state
WsLoop->>TokenRefresh: request fresh token
TokenRefresh->>CredStore: load config and session token
alt token refresh succeeds
CredStore-->>TokenRefresh: non-empty token returned
TokenRefresh-->>WsLoop: Some(new_token)
WsLoop->>WsLoop: update token variable
WsLoop->>RunConn: retry immediately with refreshed token
else token refresh fails
CredStore-->>TokenRefresh: None (missing/blank/error)
TokenRefresh-->>WsLoop: None
WsLoop->>WsLoop: apply normal failure backoff
WsLoop->>RunConn: retry on next backoff interval
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/socket/ws_loop.rs`:
- Around line 130-180: The code resets consecutive_failures and backoff before
actually obtaining a fresh token, which can suppress escalation or create tight
retry loops; instead, remove the early resets and only reset
consecutive_failures = 0 and backoff = Duration::from_millis(1000) after
refresh_session_token().await returns Some(fresh) and fresh != token (i.e., a
non-empty, different token), then assign token = fresh and continue immediately;
if refresh_session_token() returns None or returns the same token, treat it as a
regular failure path: increment consecutive_failures, call
log_connection_failure(...) as currently done, leave backoff unchanged, and do
not short-circuit the normal backoff sleep. Ensure you reference the
variables/functions consecutive_failures, backoff, token,
refresh_session_token(), and log_connection_failure when making this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 487b3d2f-7ba6-4145-859e-f2cc3c550ab1
📒 Files selected for processing (3)
src/openhuman/socket/types.rssrc/openhuman/socket/ws_loop.rssrc/openhuman/socket/ws_loop_tests.rs
…ained Move consecutive_failures/backoff resets inside the `Some(fresh) if fresh != token` arm so they only fire when refresh_session_token() returns a *different* token. Previously the counters were cleared before the refresh completed, which meant: - A failed refresh (None) would restart the failure streak from 1 on every InvalidToken loop, suppressing the sustained-outage Sentry escalation. - A refresh that returned the same stale token would spin with no backoff delay, tight-looping against the server. Now `Some(_) | None` (same token or no token) falls through the normal Failed accounting so FAIL_ESCALATE_THRESHOLD still fires correctly.
…token-refresh-tauri-rust-9c
|
Thanks for this, @graycyrus — and credit where due: this PR was filed first (this one at 04:24, #2905 at 05:47). Both target the same crash, TAURI-RUST-9C / #2892 (the stale-token "Invalid token" reconnect storm), with the same set of files. We're going to consolidate on #2905 and close this one. To be transparent about why — it's not a comment on effort, the two diffs are genuinely close — but #2905 edges ahead on a few correctness points:
Your |
Summary
Fixes Sentry issue TAURI-RUST-9C — 237 events across 0.54.0 → 0.56.0.
Root cause:
ws_loopcaptured the session token atSocketManager::connecttime. On reconnect after a disconnect/token rotation, the server returned44{"message":"Invalid token"}. The loop treated this as a regularConnectionOutcome::Failedand retried with the same stale credential 5 times before declaring "sustained outage" and firing a Sentry event.Fix:
ConnectionOutcome::InvalidTokentotypes.rs— distinct from transport failures.run_connection, detect"Invalid token"in the SIO connect-error message viais_invalid_token_error()(case-insensitive) and returnInvalidTokeninstead ofFailed.ws_loop, onInvalidToken: callrefresh_session_token()which loads config + reads the current stored JWT (same path ashandle_connect_with_session). On success (a different token), resetconsecutive_failures/backoffandcontinueimmediately with no backoff sleep. On failure (user logged out / store unavailable / same stale token returned), fall back toFailedaccounting so the Sentry escalation still fires.consecutive_failurescounter andbackoffafter a genuinely different token is returned — prevents counter suppression if the store keeps returning the same stale JWT.Tests added:
is_invalid_token_error_matches_backend_message— exact/case-variant wire shapesis_invalid_token_error_does_not_match_other_errors— no false positivesrun_connection_returns_invalid_token_on_auth_reject— end-to-end through a mock EIO server sending44{"message":"Invalid token"}connection_outcome_variants_can_be_constructed— updated to coverInvalidTokenTest plan
pnpm debug rust(orcargo test -p openhuman socket)run_connection_returns_invalid_token_on_auth_rejecte2e mock test; live rotation requires a production sessionconsecutive_failuresresets to 0 onInvalidToken(no false outage Sentry event after a single auth rejection)Closes #2892