Fix stuck Concierge thinking indicator when client misses Onyx clear update#85620
Fix stuck Concierge thinking indicator when client misses Onyx clear update#85620marcochavezf wants to merge 29 commits intomainfrom
Conversation
…dates When a client misses real-time Pusher events and catches up via GetMissingOnyxMessages, Onyx batches the SET and CLEAR merges into a single notification with the final (empty) value. The hook never sees the intermediate non-empty server label, so optimisticStartTime is never cleared — leaving the indicator stuck permanently. Fix: track NVP write count via a direct Onyx.connect subscription. When the indicator NVP is written (even if the final rendered value is empty), increment a version counter. On kickoff, snapshot the counter. The effect compares versions to detect server activity that React batching would otherwise hide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@abdulrahuman5196 Please copy/paste the Reviewer Checklist from here into a new comment on this PR and complete it. If you have the K2 extension, you can simply click: [this button] |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dffb249ec
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
PR doesn’t have any new product considerations as a code clean-up PR. Unassigning and unsubscribing myself. |
Move NVP version tracking from the hook into a dedicated NVPIndicatorVersionTracker lib to satisfy the ESLint rule that restricts Onyx.connect() to /src/libs/**. This also fixes the TypeScript callback type mismatch. Replace "backgrounded" with "in the background" for spellcheck. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…early return - Change connectionMap type from Map<string, number> to Map<string, Connection> - Import Connection type from react-native-onyx - Use early return pattern in NVP version listener callback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Looks like you've decreased code coverage for some files. Please write tests to increase, or at least maintain, the existing level of code coverage. See our documentation here for how to interpret this table.
|
…sends The version-snapshot approach failed when two messages were sent rapidly: msg1's server response bumped the version past msg2's snapshot, clearing the thinking indicator prematurely. Replaced with a pending-request counter that increments on each kickoff and decrements when a full server roundtrip (SET+CLEAR = 2 version bumps) completes, so the indicator persists until all outstanding requests are processed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stuck indicator fix The thinking indicator is ephemeral data stored in a durable system (Onyx NVPs). When the client misses the server CLEAR update (Onyx batching coalesces SET+CLEAR, Pusher reconnect delivers stale state, or the CLEAR is dropped), the indicator gets stuck permanently. This implements the "lease pattern" from distributed systems: every state assertion is time-bounded. Without renewal or explicit clear, the indicator auto-expires. Changes: - Add 60s safety timeout that auto-clears the indicator when no CLEAR arrives - Reset timer on each new server label (lease renewal) - Clear indicator on network reconnect (like typing indicators) - Remove NVPIndicatorVersionTracker entirely (TTL handles all its failure modes) - Add clearAgentZeroProcessingIndicator action to Report actions - Add 6 new test cases covering TTL, reconnect, and version tracker removal The 60s timeout is appropriate because AI processing takes 30-45s on dev, and XMPP uses 30s for composing→paused transitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spellcheck CI job flags XMPP (a messaging protocol referenced in code comments) as an unknown word. Add it to the allowed list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nects When the 60s TTL safety timer expires or Pusher reconnects, the indicator is cleared but the Concierge response that triggered the CLEAR may also have been dropped. Call getNewerActions to pull missed report actions via HTTP so the user sees the response without a manual refresh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ator Resolve merge conflicts after main refactored useAgentZeroStatusIndicator hook into AgentZeroStatusContext. Keep both: the context (from main) for standard status display, and the hook with safety timeout/TTL (from this branch) for the stuck indicator fix. Restore ConciergeReasoningStore.ts and add mocks for the combined test file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Long Concierge responses can take up to 2 minutes. A hard 60s TTL incorrectly clears a legitimate in-progress response. Instead, use a progressive retry schedule: - 60s: Call getNewerActions, keep indicator showing - 90s: Retry getNewerActions, keep indicator showing - 120s: Final retry — if still no new actions AND online, clear indicator If the Concierge reply arrives at any point (via Pusher or getNewerActions response), the normal Onyx update clears the indicator automatically. If offline at final retry, skip the clear — the response may arrive when reconnected (onReconnect handler covers that case). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The hook now uses progressive retry intervals (60s → 90s → 120s) instead of a single 60s hard-clear timeout. Update the two failing tests to verify that intermediate retries (60s, 90s) only poll via getNewerActions while the indicator stays visible, and only the final retry (120s) clears the indicator. Also add getNewerActions to the Report mock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of three separate setTimeout calls at 60s/90s/120s, poll getNewerActions every 30s via setInterval while the thinking indicator is visible. This is simpler and more robust against WebSocket drops. The 120s safety timeout is kept as a hard clear fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The onReconnect handler was clearing the indicator immediately, which caused the thinking state to disappear while offline. Now it fetches newer actions and restarts polling without clearing — the indicator stays visible until the actual server CLEAR arrives via Onyx. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ear on reconnect The original design intentionally hides 'Concierge is thinking...' while offline (!isOffline in isProcessing). This is correct UX — showing a thinking indicator when the user can't receive a response is misleading. On reconnect, the indicator reappears naturally if the server NVP still has processing state. getNewerActions fetches any missed responses, and polling restarts as safety net. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Option C) Replace duplicated Pusher/reasoning/polling logic in the Context with a thin wrapper that delegates to the hook. Update tests to use ConciergeReasoningStore and subscribeToReportReasoningEvents instead of direct Pusher mocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of waiting for the next 30s poll cycle to detect the Concierge actorAccountID, watch the newestReportAction Onyx subscription directly. When getNewerActions fetches the response and Onyx merges it, the useEffect fires immediately and clears the indicator. Reduces clear delay from ~55s to <5s after reconnect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The useEffect referencing optimisticStartTime and clearPolling was placed before they were declared, causing a ReferenceError (temporal dead zone). Moved to after all dependencies are defined. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use early returns instead of nested if, and newestActorAccountID instead of the full newestReportAction object in the dependency array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0790eec47c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- P1: Reset optimistic state (pendingOptimisticRequests) when clearing on Concierge reply - P2: Scope counter to specific requests (increment/reset pattern) - P2: Preserve Pusher subscription handle for proper per-callback cleanup - CLEAN-REACT-PATTERNS-0: Remove manual useCallback/useMemo (React Compiler handles it) - PERF-14: Replace dual useEffect with useSyncExternalStore for ConciergeReasoningStore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The useSyncExternalStore hook requires stable references for its subscribe and getSnapshot callbacks. Without React Compiler active in the test environment, these functions were recreated every render, causing infinite re-subscribe loops. Three fixes: - Wrap subscribe/getSnapshot in useCallback for useSyncExternalStore - Return stable EMPTY_ENTRIES reference from ConciergeReasoningStore instead of creating a new [] on each getSnapshot call - Memoize kickoffWaitingIndicator, return value, and context values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… memoization) React Compiler is enabled and this file compiles successfully, making manual useCallback and useMemo wrappers redundant. The compiler automatically memoizes closures and derived values based on their captured variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ix-stuck-thinking-indicator # Conflicts: # src/pages/inbox/AgentZeroStatusContext.tsx
|
Hi @abdulrahuman5196, the last backend PR has been deployed to staging, so the PR is ready for review |
PR ReviewOverall this is a solid architectural improvement — extracting the indicator logic into a dedicated hook with polling recovery is the right approach for handling missed Pusher events. A few issues to address: Issues1.
Every other Since // Module-level subscribe that passes reportID to listeners
const subscribeToReasoningStore = useCallback((onStoreChange: () => void) => {
return ConciergeReasoningStore.subscribe((updatedReportID) => {
if (updatedReportID !== reportID) {
return;
}
onStoreChange();
});
}, [reportID]);
const getReasoningSnapshot = useCallback(
() => ConciergeReasoningStore.getReasoningHistory(reportID),
[reportID]
);Or better: make 2. The main 3. Unnecessary As 4. These functions are defined inline and capture refs + state. The eslint-disable comment says "stable via React Compiler," but this is fragile — if React Compiler isn't applied or the function identity changes, the effect will re-run and restart polling unexpectedly. At minimum, document which React Compiler version/config guarantees this, or wrap in Minor / Non-blocking
Next Steps: Reply with |
- Resolve merge conflict in AgentZeroStatusContext.tsx: adopt main chatType selector instead of prop, keep hook-based architecture - Remove unnecessary XMPP entry from cspell.json (situchan review) - Remove useMemo wrapper in AgentZeroStatusGate (React Compiler handles it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewer Checklist
Screenshots/VideosAndroid: HybridAppandroid.movAndroid: mWeb Chromemchrome.moviOS: HybridAppios.moviOS: mWeb Safarimsafari.movMacOS: Chrome / Safariweb.mov |
|
TypeScript failing |
PR #87077 refactored AddNewCardPage to useConfirmModal but left a dangling isModalVisible reference (removed the useState but not the usage). This breaks typecheck and ESLint on any branch that merges main after that commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Updated! |
|
Concierge thinking indicator is still stuck for this scenario:
Screen.Recording.2026-04-09.at.3.32.54.AM.mov |
| break; | ||
| default: | ||
| CurrentStep = <SelectCountryStep disableAutoFocus={isModalVisible} />; | ||
| CurrentStep = <SelectCountryStep />; |
There was a problem hiding this comment.
Undo this change and merge main. They fixed another way in #87431
|
I compared the original 1. Missing: Clear optimistic state on network reconnectOriginal (main if (!isOffline && optimisticStartTime) {
setOptimisticStartTime(null);
}New ( This may be intentional (keep showing "thinking" until we confirm the response arrived rather than briefly hiding it), but it's a behavioral change from the original. 2.
|
|
Concierge thinking indicator not showing at all on mobile this branch: Screen.Recording.2026-04-09.at.12.57.14.PM.mov
Screen.Recording.2026-04-09.at.12.54.46.PM.mov |
On hold for
https://github.com/Expensify/Web-Expensify/pull/51695andhttps://github.com/Expensify/Auth/pull/20741Explanation of Change
When a client misses real-time Pusher events (e.g., tab backgrounded, brief disconnect), the
agentZeroProcessingRequestIndicatorNVP can get permanently stuck in the "thinking" state. This is a fundamental architectural mismatch: the thinking indicator is ephemeral/transient data stored in a durable system (Onyx NVPs).Every major real-time platform (WhatsApp, Discord, Slack, XMPP, PubNub) treats processing/typing indicators as ephemeral signals with client-side TTL timeouts — none persist them durably. Expensify own typing indicator (
Report/index.ts:510-523) already solves this exact problem with a client-side timeout (10s for Concierge).Fix: Apply a polling + safety timeout pattern to recover from missed WebSocket events:
isProcessingbecomes true, start asetIntervalthat callsgetNewerActions(reportID, newestReportActionID)every 30s. This actively fetches any Concierge response that was sent while the WebSocket was disconnected. Polling only fires when online (checksisOfflineRef).useNetwork()to detect reconnection and clear the indicator + restart polling.Why polling over a hard TTL: Concierge responses can take up to 2 minutes for complex queries. A hard 60s TTL would incorrectly clear a legitimate in-progress response. The 30s poll keeps the indicator visible while actively checking for the response — if it arrives, the normal Onyx update clears the indicator. Only after 120s (4 failed polls) do we hard-clear.
Backed by research: 17 industry sources converge on client-side recovery for transient indicators. Pusher does NOT replay missed messages on reconnection — polling
getNewerActionsis the only way to recover lost events.New files:
src/hooks/useAgentZeroStatusIndicator.ts— Core hook with polling, TTL, reconnect logic, and Concierge response detectionsrc/libs/ConciergeReasoningStore.ts— Ephemeral in-memory store for reasoning summaries (not persisted to Onyx)Modified files:
src/pages/inbox/AgentZeroStatusContext.tsx— Simplified to delegate to the new hookFixed Issues
$ https://github.com/Expensify/Expensify/issues/612534
Tests
Unit tests (24 total):
Test evidence: All 24 tests pass, zero regressions. All CI checks green (ESLint, TypeScript, Jest, Prettier, perf-tests, spellcheck, build).
Offline tests
!isOfflineinisProcessing— original design)QA Steps
Same as tests above. Key scenarios:
Normal flow: indicator appears, Concierge responds, indicator clears
Offline recovery: indicator appears -> go offline -> come back online -> response arrives via poll or reconnect -> indicator clears
Safety timeout: if 4 polls (120s) return nothing while online, indicator hard-clears
Pusher drop (exact production failure mode): use
window.getPusherInstance().disconnect()to drop only the WebSocket while keeping HTTP alive -> indicator persists -> reconnect -> response recovered via pollingVerify that no errors appear in the JS console
PR Author Checklist
### Fixed Issuessection aboveTestssectionOffline stepssectionQA stepssectiontoggleReportand notonIconClick)src/languages/*files and using the translation methodSTYLE.md) were followedAvatar, I verified the components usingAvatarare working as expected)StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))npm run compress-svg)Avataris modified, I verified thatAvataris working as expected in all cases)Designlabel and/or tagged@Expensify/designso the design team can review the changes.ScrollViewcomponent to make it scrollable when more elements are added to the page.mainbranch was merged into this PR after a review, I tested again and verified the outcome was still expected according to theTeststeps.Screenshots/Videos
Android: Native
N/A — This change is hook/logic only (no UI component changes). The thinking indicator renders identically across all platforms via existing components.
Android: mWeb Chrome
N/A — Hook/logic change only, tested via web screenshots below.
iOS: Native
N/A — Hook/logic change only, tested via web screenshots below.
iOS: mWeb Safari
N/A — Hook/logic change only, tested via web screenshots below.
MacOS: Chrome / Safari
Pusher disconnect recovery test (exact production failure mode):
Uses
window.getPusherInstance().disconnect()to drop only the WebSocket while keeping HTTP alive — this is exactly what happens in production when Pusher has a brief hiccup.Video (full Pusher-drop cycle, 60s disconnected):
https://github.com/Expensify/Expensify/releases/download/untagged-612534-polling-1774719056/pusher-drop-recovery.webm
Key result: The thinking indicator persisted for the full 60s while Pusher was disconnected. After Pusher reconnected:
getNewerActionspolling — 5 calls detected)actorAccountID === CONCIERGE)AuthenticatePusher+ 5GetNewerActionscalls confirmed recovery mechanismNote on offline behavior: When fully offline (
setOffline(true)), the indicator is intentionally hidden (!isOfflineinisProcessing) — this is the original design. The bug only manifests during WebSocket drops where the user appears online but misses Pusher events.Video (full Pusher-drop cycle with passing result):
https://github.com/Expensify/Expensify/releases/download/untagged-612534-polling-1774719056/pusher-drop-final.webm
Test Evidence (Automated Unit Tests)
Investigation Summary
Root cause: The
agentZeroProcessingRequestIndicatorNVP is set by the server when Concierge starts processing and cleared when it finishes. If the client misses the clear event (WebSocket drop, tab backgrounded, Pusher hiccup), the local Onyx cache retains the stale non-empty value permanently — there was no recovery mechanism.Solution approach (informed by 17 industry sources on real-time indicator patterns):
getNewerActionsAPI was already available and returns NVPs when the Auth-side companion PR is includedCompanion PR: Auth #20717 — Includes report NVPs in
GetNewerActionsresponse so the HTTP fallback can recover the current indicator state.