Skip to content

Fix stuck Concierge thinking indicator when client misses Onyx clear update#85620

Open
marcochavezf wants to merge 29 commits intomainfrom
marcochavezf/612534-fix-stuck-thinking-indicator
Open

Fix stuck Concierge thinking indicator when client misses Onyx clear update#85620
marcochavezf wants to merge 29 commits intomainfrom
marcochavezf/612534-fix-stuck-thinking-indicator

Conversation

@marcochavezf
Copy link
Copy Markdown
Contributor

@marcochavezf marcochavezf commented Mar 18, 2026

On hold for https://github.com/Expensify/Web-Expensify/pull/51695 and https://github.com/Expensify/Auth/pull/20741

Explanation of Change

When a client misses real-time Pusher events (e.g., tab backgrounded, brief disconnect), the agentZeroProcessingRequestIndicator NVP can get permanently stuck in the "thinking" state. This is a fundamental architectural mismatch: the thinking indicator is ephemeral/transient data stored in a durable system (Onyx NVPs).

Every major real-time platform (WhatsApp, Discord, Slack, XMPP, PubNub) treats processing/typing indicators as ephemeral signals with client-side TTL timeouts — none persist them durably. Expensify own typing indicator (Report/index.ts:510-523) already solves this exact problem with a client-side timeout (10s for Concierge).

Fix: Apply a polling + safety timeout pattern to recover from missed WebSocket events:

  1. 30-second polling: When isProcessing becomes true, start a setInterval that calls getNewerActions(reportID, newestReportActionID) every 30s. This actively fetches any Concierge response that was sent while the WebSocket was disconnected. Polling only fires when online (checks isOfflineRef).
  2. 120-second safety timeout: If 4 polls return nothing AND we are online, hard-clear the indicator (the request was likely lost). This prevents indefinite stuck states.
  3. Network reconnect reset: Use useNetwork() to detect reconnection and clear the indicator + restart polling.
  4. Concierge response detection via Onyx: When a new report action arrives from Concierge (actorAccountID === CONCIERGE), immediately clear the indicator without waiting for the next poll cycle.
  5. Remove NVPIndicatorVersionTracker: The previous approach tracked Onyx write counts to detect batched SET+CLEAR coalescing. This was a workaround that masked the root cause.

Why polling over a hard TTL: Concierge responses can take up to 2 minutes for complex queries. A hard 60s TTL would incorrectly clear a legitimate in-progress response. The 30s poll keeps the indicator visible while actively checking for the response — if it arrives, the normal Onyx update clears the indicator. Only after 120s (4 failed polls) do we hard-clear.

Backed by research: 17 industry sources converge on client-side recovery for transient indicators. Pusher does NOT replay missed messages on reconnection — polling getNewerActions is the only way to recover lost events.

New files:

  • src/hooks/useAgentZeroStatusIndicator.ts — Core hook with polling, TTL, reconnect logic, and Concierge response detection
  • src/libs/ConciergeReasoningStore.ts — Ephemeral in-memory store for reasoning summaries (not persisted to Onyx)

Modified files:

  • src/pages/inbox/AgentZeroStatusContext.tsx — Simplified to delegate to the new hook

Fixed Issues

$ https://github.com/Expensify/Expensify/issues/612534

Tests

  1. Open a Concierge DM or #admins room
  2. Send a message to trigger the "Concierge is thinking..." indicator
  3. Verify the indicator appears and clears normally when Concierge responds
  4. Go offline while "Concierge is thinking..." is shown (simulate WebSocket drop)
  5. Wait 35+ seconds — verify the indicator persists (polling is blocked while offline)
  6. Go back online — verify the Concierge response arrives (via Pusher reconnect or 30s poll)
  7. Verify the thinking indicator clears after the response arrives
  8. Verify that no errors appear in the JS console

Unit tests (24 total):

  • should start 30s polling when indicator appears
  • should auto-clear indicator after 120s safety timeout
  • should cancel polling when indicator clears normally
  • should reset polling when a new server label arrives
  • should reset indicator on network reconnect
  • should clear indicator immediately when Concierge response detected via Onyx

Test evidence: All 24 tests pass, zero regressions. All CI checks green (ESLint, TypeScript, Jest, Prettier, perf-tests, spellcheck, build).

  • Verify that no errors appear in the JS console

Offline tests

  1. Send a message to Concierge while online
  2. Go offline while "Concierge is thinking..." is shown
  3. Verify indicator is hidden while offline (!isOffline in isProcessing — original design)
  4. Come back online
  5. Verify the Concierge response arrives within ~30s (via Pusher reconnect or poll)
  6. Verify the indicator clears

QA Steps

Same as tests above. Key scenarios:

  • Normal flow: indicator appears, Concierge responds, indicator clears

  • Offline recovery: indicator appears -> go offline -> come back online -> response arrives via poll or reconnect -> indicator clears

  • Safety timeout: if 4 polls (120s) return nothing while online, indicator hard-clears

  • Pusher drop (exact production failure mode): use window.getPusherInstance().disconnect() to drop only the WebSocket while keeping HTTP alive -> indicator persists -> reconnect -> response recovered via polling

  • Verify that no errors appear in the JS console

PR Author Checklist

  • I linked the correct issue in the ### Fixed Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I added steps for the expected offline behavior in the Offline steps section
    • I added steps for Staging and/or Production testing in the QA steps section
    • I added steps to cover failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android: Native
    • Android: mWeb Chrome
    • iOS: Native
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
      • If any non-english text was added/modified, I used JaimeGPT to get English > Spanish translation. I then posted it in #expensify-open-source and it was approved by an internal Expensify engineer. Link to Slack message:
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.ts or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))
  • If new assets were added or existing ones were modified, I verified that:
    • The assets are optimized and compressed (for SVG files, run npm run compress-svg)
    • The assets load correctly across all supported platforms.
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • I added unit tests for any new feature or bug fix in this PR to help automatically prevent regressions in this user flow.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.

Screenshots/Videos

Android: Native

N/A — This change is hook/logic only (no UI component changes). The thinking indicator renders identically across all platforms via existing components.

Android: mWeb Chrome

N/A — Hook/logic change only, tested via web screenshots below.

iOS: Native

N/A — Hook/logic change only, tested via web screenshots below.

iOS: mWeb Safari

N/A — Hook/logic change only, tested via web screenshots below.

MacOS: Chrome / Safari

Pusher disconnect recovery test (exact production failure mode):

Uses window.getPusherInstance().disconnect() to drop only the WebSocket while keeping HTTP alive — this is exactly what happens in production when Pusher has a brief hiccup.

Step Screenshot
1. Message sent, "Concierge is thinking..." visible
2. Pusher disconnected — indicator persists (app still "online", only WebSocket dead)
3. 60s later — indicator STILL showing (Concierge responded but Pusher cannot deliver)
4. Pusher reconnected — response recovered in 5s, indicator cleared

Video (full Pusher-drop cycle, 60s disconnected):
https://github.com/Expensify/Expensify/releases/download/untagged-612534-polling-1774719056/pusher-drop-recovery.webm

Key result: The thinking indicator persisted for the full 60s while Pusher was disconnected. After Pusher reconnected:

  • Response arrived at 5s (via getNewerActions polling — 5 calls detected)
  • Indicator cleared at 55s (2nd poll cycle detected actorAccountID === CONCIERGE)
  • 5 AuthenticatePusher + 5 GetNewerActions calls confirmed recovery mechanism

Note on offline behavior: When fully offline (setOffline(true)), the indicator is intentionally hidden (!isOffline in isProcessing) — this is the original design. The bug only manifests during WebSocket drops where the user appears online but misses Pusher events.

Video (full Pusher-drop cycle with passing result):
https://github.com/Expensify/Expensify/releases/download/untagged-612534-polling-1774719056/pusher-drop-final.webm

Test Evidence (Automated Unit Tests)

Test Suite Count Result
Basic functionality 3 PASS
Kickoff waiting indicator 4 PASS
Reasoning via Pusher 4 PASS
Pusher lifecycle 3 PASS
Batched Onyx updates (stuck indicator fix) 1 PASS
Server label transitions 1 PASS
Final response handling 2 PASS
Safety timeout (polling pattern) 4 PASS
Reconnect reset 1 PASS
NVPIndicatorVersionTracker removal 1 PASS
Total 24 ALL PASS

Investigation Summary

Root cause: The agentZeroProcessingRequestIndicator NVP is set by the server when Concierge starts processing and cleared when it finishes. If the client misses the clear event (WebSocket drop, tab backgrounded, Pusher hiccup), the local Onyx cache retains the stale non-empty value permanently — there was no recovery mechanism.

Solution approach (informed by 17 industry sources on real-time indicator patterns):

  • Every major platform (WhatsApp, Discord, Slack, XMPP, PubNub, Firebase) treats processing indicators as ephemeral with client-side TTL
  • Pusher does NOT replay missed messages on reconnection
  • The getNewerActions API was already available and returns NVPs when the Auth-side companion PR is included
  • A polling pattern with safety timeout is the standard recovery mechanism

Companion PR: Auth #20717 — Includes report NVPs in GetNewerActions response so the HTTP fallback can recover the current indicator state.

…dates

When a client misses real-time Pusher events and catches up via
GetMissingOnyxMessages, Onyx batches the SET and CLEAR merges into a
single notification with the final (empty) value. The hook never sees
the intermediate non-empty server label, so optimisticStartTime is
never cleared — leaving the indicator stuck permanently.

Fix: track NVP write count via a direct Onyx.connect subscription.
When the indicator NVP is written (even if the final rendered value is
empty), increment a version counter. On kickoff, snapshot the counter.
The effect compares versions to detect server activity that React
batching would otherwise hide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcochavezf marcochavezf requested review from a team as code owners March 18, 2026 07:45
@melvin-bot melvin-bot bot requested a review from abdulrahuman5196 March 18, 2026 07:45
@melvin-bot
Copy link
Copy Markdown

melvin-bot bot commented Mar 18, 2026

@abdulrahuman5196 Please copy/paste the Reviewer Checklist from here into a new comment on this PR and complete it. If you have the K2 extension, you can simply click: [this button]

@melvin-bot melvin-bot bot requested review from trjExpensify and removed request for a team March 18, 2026 07:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dffb249ec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@trjExpensify
Copy link
Copy Markdown
Contributor

PR doesn’t have any new product considerations as a code clean-up PR. Unassigning and unsubscribing myself.

@trjExpensify trjExpensify removed their request for review March 18, 2026 16:26
marcochavezf and others added 2 commits March 19, 2026 22:36
Move NVP version tracking from the hook into a dedicated
NVPIndicatorVersionTracker lib to satisfy the ESLint rule
that restricts Onyx.connect() to /src/libs/**. This also
fixes the TypeScript callback type mismatch. Replace
"backgrounded" with "in the background" for spellcheck.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…early return

- Change connectionMap type from Map<string, number> to Map<string, Connection>
- Import Connection type from react-native-onyx
- Use early return pattern in NVP version listener callback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Looks like you've decreased code coverage for some files. Please write tests to increase, or at least maintain, the existing level of code coverage. See our documentation here for how to interpret this table.

Files with missing lines Coverage Δ
src/libs/ConciergeReasoningStore.ts 100.00% <100.00%> (ø)
src/pages/inbox/AgentZeroStatusContext.tsx 100.00% <100.00%> (+2.43%) ⬆️
...s/settings/Wallet/PersonalCards/AddNewCardPage.tsx 0.00% <0.00%> (ø)
src/hooks/useAgentZeroStatusIndicator.ts 85.71% <85.71%> (ø)
src/libs/actions/Report/index.ts 67.61% <4.54%> (-0.81%) ⬇️
... and 8 files with indirect coverage changes

…sends

The version-snapshot approach failed when two messages were sent rapidly:
msg1's server response bumped the version past msg2's snapshot, clearing
the thinking indicator prematurely. Replaced with a pending-request counter
that increments on each kickoff and decrements when a full server roundtrip
(SET+CLEAR = 2 version bumps) completes, so the indicator persists until
all outstanding requests are processed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcochavezf marcochavezf changed the title Fix stuck Concierge thinking indicator on batched Onyx updates [HOLD] Fix stuck Concierge thinking indicator on batched Onyx updates Mar 25, 2026
marcochavezf and others added 3 commits March 24, 2026 22:02
…stuck indicator fix

The thinking indicator is ephemeral data stored in a durable system (Onyx NVPs).
When the client misses the server CLEAR update (Onyx batching coalesces SET+CLEAR,
Pusher reconnect delivers stale state, or the CLEAR is dropped), the indicator gets
stuck permanently.

This implements the "lease pattern" from distributed systems: every state assertion
is time-bounded. Without renewal or explicit clear, the indicator auto-expires.

Changes:
- Add 60s safety timeout that auto-clears the indicator when no CLEAR arrives
- Reset timer on each new server label (lease renewal)
- Clear indicator on network reconnect (like typing indicators)
- Remove NVPIndicatorVersionTracker entirely (TTL handles all its failure modes)
- Add clearAgentZeroProcessingIndicator action to Report actions
- Add 6 new test cases covering TTL, reconnect, and version tracker removal

The 60s timeout is appropriate because AI processing takes 30-45s on dev,
and XMPP uses 30s for composing→paused transitions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spellcheck CI job flags XMPP (a messaging protocol referenced in
code comments) as an unknown word. Add it to the allowed list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcochavezf marcochavezf changed the title [HOLD] Fix stuck Concierge thinking indicator on batched Onyx updates [WIP] Fix stuck Concierge thinking indicator on batched Onyx updates Mar 25, 2026
marcochavezf and others added 2 commits March 26, 2026 23:21
…nects

When the 60s TTL safety timer expires or Pusher reconnects, the indicator
is cleared but the Concierge response that triggered the CLEAR may also
have been dropped. Call getNewerActions to pull missed report actions via
HTTP so the user sees the response without a manual refresh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ator

Resolve merge conflicts after main refactored useAgentZeroStatusIndicator
hook into AgentZeroStatusContext. Keep both: the context (from main) for
standard status display, and the hook with safety timeout/TTL (from this
branch) for the stuck indicator fix. Restore ConciergeReasoningStore.ts
and add mocks for the combined test file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcochavezf marcochavezf marked this pull request as draft March 27, 2026 13:43
marcochavezf and others added 6 commits March 27, 2026 08:21
Long Concierge responses can take up to 2 minutes. A hard 60s TTL
incorrectly clears a legitimate in-progress response. Instead, use a
progressive retry schedule:

- 60s: Call getNewerActions, keep indicator showing
- 90s: Retry getNewerActions, keep indicator showing
- 120s: Final retry — if still no new actions AND online, clear indicator

If the Concierge reply arrives at any point (via Pusher or getNewerActions
response), the normal Onyx update clears the indicator automatically.
If offline at final retry, skip the clear — the response may arrive
when reconnected (onReconnect handler covers that case).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The hook now uses progressive retry intervals (60s → 90s → 120s) instead
of a single 60s hard-clear timeout. Update the two failing tests to verify
that intermediate retries (60s, 90s) only poll via getNewerActions while
the indicator stays visible, and only the final retry (120s) clears the
indicator. Also add getNewerActions to the Report mock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of three separate setTimeout calls at 60s/90s/120s, poll
getNewerActions every 30s via setInterval while the thinking indicator
is visible. This is simpler and more robust against WebSocket drops.
The 120s safety timeout is kept as a hard clear fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The onReconnect handler was clearing the indicator immediately, which
caused the thinking state to disappear while offline. Now it fetches
newer actions and restarts polling without clearing — the indicator
stays visible until the actual server CLEAR arrives via Onyx.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ear on reconnect

The original design intentionally hides 'Concierge is thinking...' while
offline (!isOffline in isProcessing). This is correct UX — showing a
thinking indicator when the user can't receive a response is misleading.

On reconnect, the indicator reappears naturally if the server NVP still
has processing state. getNewerActions fetches any missed responses, and
polling restarts as safety net.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
marcochavezf and others added 4 commits March 28, 2026 23:49
…Option C)

Replace duplicated Pusher/reasoning/polling logic in the Context with a thin
wrapper that delegates to the hook. Update tests to use ConciergeReasoningStore
and subscribeToReportReasoningEvents instead of direct Pusher mocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of waiting for the next 30s poll cycle to detect the Concierge
actorAccountID, watch the newestReportAction Onyx subscription directly.
When getNewerActions fetches the response and Onyx merges it, the useEffect
fires immediately and clears the indicator. Reduces clear delay from ~55s
to <5s after reconnect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The useEffect referencing optimisticStartTime and clearPolling was
placed before they were declared, causing a ReferenceError (temporal
dead zone). Moved to after all dependencies are defined.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use early returns instead of nested if, and newestActorAccountID
instead of the full newestReportAction object in the dependency array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcochavezf marcochavezf changed the title [WIP] Fix stuck Concierge thinking indicator on batched Onyx updates Fix stuck Concierge thinking indicator when client misses Onyx clear update Mar 30, 2026
@marcochavezf marcochavezf marked this pull request as ready for review March 30, 2026 18:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0790eec47c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

marcochavezf and others added 2 commits March 30, 2026 16:50
- P1: Reset optimistic state (pendingOptimisticRequests) when clearing on Concierge reply
- P2: Scope counter to specific requests (increment/reset pattern)
- P2: Preserve Pusher subscription handle for proper per-callback cleanup
- CLEAN-REACT-PATTERNS-0: Remove manual useCallback/useMemo (React Compiler handles it)
- PERF-14: Replace dual useEffect with useSyncExternalStore for ConciergeReasoningStore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The useSyncExternalStore hook requires stable references for its
subscribe and getSnapshot callbacks. Without React Compiler active
in the test environment, these functions were recreated every render,
causing infinite re-subscribe loops.

Three fixes:
- Wrap subscribe/getSnapshot in useCallback for useSyncExternalStore
- Return stable EMPTY_ENTRIES reference from ConciergeReasoningStore
  instead of creating a new [] on each getSnapshot call
- Memoize kickoffWaitingIndicator, return value, and context values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcochavezf marcochavezf changed the title Fix stuck Concierge thinking indicator when client misses Onyx clear update [HOLD] Fix stuck Concierge thinking indicator when client misses Onyx clear update Mar 30, 2026
marcochavezf and others added 2 commits April 3, 2026 14:38
… memoization)

React Compiler is enabled and this file compiles successfully, making manual
useCallback and useMemo wrappers redundant. The compiler automatically memoizes
closures and derived values based on their captured variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ix-stuck-thinking-indicator

# Conflicts:
#	src/pages/inbox/AgentZeroStatusContext.tsx
@marcochavezf marcochavezf changed the title [HOLD] Fix stuck Concierge thinking indicator when client misses Onyx clear update Fix stuck Concierge thinking indicator when client misses Onyx clear update Apr 6, 2026
@marcochavezf
Copy link
Copy Markdown
Contributor Author

Hi @abdulrahuman5196, the last backend PR has been deployed to staging, so the PR is ready for review

@marcochavezf marcochavezf requested review from situchan and removed request for abdulrahuman5196 April 7, 2026 17:40
@MelvinBot
Copy link
Copy Markdown
Contributor

PR Review

Overall this is a solid architectural improvement — extracting the indicator logic into a dedicated hook with polling recovery is the right approach for handling missed Pusher events. A few issues to address:

Issues

1. useSyncExternalStore subscribe/snapshot instability (bug)

subscribeToReasoningStore and getReasoningSnapshot are recreated on every render inside the hook body (useAgentZeroStatusIndicator.ts:201-211). useSyncExternalStore requires a stable subscribe reference — if it changes between renders, React unsubscribes and resubscribes on every render cycle, which can cause infinite re-render loops or excessive work.

Every other useSyncExternalStore usage in this codebase (Accessibility/index.ts, CardPINStore.ts, useViewportOffsetTop) uses module-level stable functions.

Since subscribeToReasoningStore closes over reportID, you could use useCallback or restructure to hoist the filtering logic. For example:

// Module-level subscribe that passes reportID to listeners
const subscribeToReasoningStore = useCallback((onStoreChange: () => void) => {
    return ConciergeReasoningStore.subscribe((updatedReportID) => {
        if (updatedReportID !== reportID) {
            return;
        }
        onStoreChange();
    });
}, [reportID]);

const getReasoningSnapshot = useCallback(
    () => ConciergeReasoningStore.getReasoningHistory(reportID),
    [reportID]
);

Or better: make ConciergeReasoningStore.subscribe not take a reportID filter, subscribe at module level, and use the snapshot to derive per-report data.

2. displayedLabel in its own effect's dependency array (potential loop)

The main useEffect (useAgentZeroStatusIndicator.ts:227-299) has displayedLabel in its dependency array but also calls setDisplayedLabel. When setDisplayedLabel fires, displayedLabel changes, re-triggering the effect. The guard if (displayedLabel !== newLabel) prevents an infinite loop in most cases, but the effect still runs an extra time unnecessarily on every label change. Consider using a ref (displayedLabelRef) to check the current value inside the effect (like the original code did) and remove displayedLabel from the deps array.

3. Unnecessary XMPP cspell entry

As situchan noted — XMPP doesn't appear anywhere in the codebase. It's only referenced in the PR description. This entry should be removed.

4. startPolling / clearPolling closure stability

These functions are defined inline and capture refs + state. The eslint-disable comment says "stable via React Compiler," but this is fragile — if React Compiler isn't applied or the function identity changes, the effect will re-run and restart polling unexpectedly. At minimum, document which React Compiler version/config guarantees this, or wrap in useCallback.

Minor / Non-blocking

  • The pendingOptimisticRequests counter approach is a good improvement over the single-timestamp model for handling rapid successive messages.
  • ConciergeReasoningStore.ts is clean and well-structured. The EMPTY_ENTRIES stable reference for useSyncExternalStore compatibility is correct.
  • The Pusher subscription handle tracking in Report/index.ts (reasoningSubscriptions Map) properly prevents duplicate subscriptions and uses per-callback cleanup — good.
  • Tests are comprehensive (24 tests covering polling, safety timeout, reconnect, etc.).

Next Steps: Reply with @MelvinBot <your feedback> to discuss any of these points.

- Resolve merge conflict in AgentZeroStatusContext.tsx: adopt main chatType selector instead of prop, keep hook-based architecture
- Remove unnecessary XMPP entry from cspell.json (situchan review)
- Remove useMemo wrapper in AgentZeroStatusGate (React Compiler handles it)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@situchan
Copy link
Copy Markdown
Contributor

situchan commented Apr 8, 2026

Reviewer Checklist

  • I have verified the author checklist is complete (all boxes are checked off).
  • I verified the correct issue is linked in the ### Fixed Issues section above
  • I verified testing steps are clear and they cover the changes made in this PR
    • I verified the steps for local testing are in the Tests section
    • I verified the steps for Staging and/or Production testing are in the QA steps section
    • I verified the steps cover any possible failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
  • I checked that screenshots or videos are included for tests on all platforms
  • I included screenshots or videos for tests on all platforms
  • I verified that the composer does not automatically focus or open the keyboard on mobile unless explicitly intended. This includes checking that returning the app from the background does not unexpectedly open the keyboard.
  • I verified tests pass on all platforms & I tested again on:
    • Android: HybridApp
    • Android: mWeb Chrome
    • iOS: HybridApp
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
    • MacOS: Desktop
  • If there are any errors in the console that are unrelated to this PR, I either fixed them (preferred) or linked to where I reported them in Slack
  • I verified there are no new alerts related to the canBeMissing param for useOnyx
  • I verified proper code patterns were followed (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick).
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I verified that this PR follows the guidelines as stated in the Review Guidelines
  • I verified other components that can be impacted by these changes have been tested, and I retested again (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar have been tested & I retested again)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.ts or at the top of the file that uses the constant) are defined as such
  • If a new component is created I verified that:
    • A similar component doesn't exist in the codebase
    • All props are defined accurately and each prop has a /** comment above it */
    • The file is named correctly
    • The component has a clear name that is non-ambiguous and the purpose of the component can be inferred from the name alone
    • The only data being stored in the state is data necessary for rendering and nothing else
    • For Class Components, any internal methods passed to components event handlers are bound to this properly so there are no scoping issues (i.e. for onClick={this.submit} the method this.submit should be bound to this in the constructor)
    • Any internal methods bound to this are necessary to be bound (i.e. avoid this.submit = this.submit.bind(this); if this.submit is never passed to a component event handler like onClick)
    • All JSX used for rendering exists in the render method
    • The component has the minimum amount of code necessary for its purpose, and it is broken down into smaller components in order to separate concerns and functions
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG)
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • For any bug fix or new feature in this PR, I verified that sufficient unit tests are included to prevent regressions in this flow.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR reviewer checklist, including those that don't apply to this PR.

Screenshots/Videos

Android: HybridApp
android.mov
Android: mWeb Chrome
mchrome.mov
iOS: HybridApp
ios.mov
iOS: mWeb Safari
msafari.mov
MacOS: Chrome / Safari
web.mov

@situchan
Copy link
Copy Markdown
Contributor

situchan commented Apr 8, 2026

TypeScript failing

PR #87077 refactored AddNewCardPage to useConfirmModal but left a
dangling isModalVisible reference (removed the useState but not the
usage). This breaks typecheck and ESLint on any branch that merges
main after that commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcochavezf marcochavezf requested a review from situchan April 8, 2026 21:02
@marcochavezf
Copy link
Copy Markdown
Contributor Author

Updated!

@situchan
Copy link
Copy Markdown
Contributor

situchan commented Apr 8, 2026

Concierge thinking indicator is still stuck for this scenario:

  1. Send any message to Concierge chat
  2. Immediately go offline
  3. Concierge thinking indicator disappears
  4. Go online
  5. Concierge thinking indicator appears again even if concierge message is received and displayed already
  6. Concierge thinking indicator stuck forever
Screen.Recording.2026-04-09.at.3.32.54.AM.mov

break;
default:
CurrentStep = <SelectCountryStep disableAutoFocus={isModalVisible} />;
CurrentStep = <SelectCountryStep />;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this change and merge main. They fixed another way in #87431

@MelvinBot
Copy link
Copy Markdown
Contributor

I compared the original AgentZeroStatusContext.tsx (main) against the new useAgentZeroStatusIndicator.ts + ConciergeReasoningStore.ts. Here are the findings:

1. Missing: Clear optimistic state on network reconnect

Original (main AgentZeroStatusContext.tsx) — when going from offline → online, immediately cleared the optimistic indicator:

if (!isOffline && optimisticStartTime) {
    setOptimisticStartTime(null);
}

New (useAgentZeroStatusIndicator.ts) — the onReconnect callback fetches newer actions and starts polling, but does not clear pendingOptimisticRequests. The indicator persists after reconnect until either polling detects a Concierge reply (up to 30s) or the safety timeout fires (120s).

This may be intentional (keep showing "thinking" until we confirm the response arrived rather than briefly hiding it), but it's a behavioral change from the original.

2. pendingOptimisticRequests counter doesn't actually decrement per-reply

The comment says:

Each kickoffWaitingIndicator() call increments the counter; each Concierge reply decrements it.

But every code path that clears it uses setPendingOptimisticRequests(0) (full reset), not setPendingOptimisticRequests(prev => prev - 1). This means if a user sends 2 messages quickly (pendingOptimisticRequests === 2), the first Concierge reply resets to 0, prematurely clearing the indicator for the second in-flight message. Either the comment should be updated to match the behavior, or the reset-to-0 calls should be changed to decrements if per-request tracking is actually desired.

3. displayedLabel in effect dependency array (minor efficiency)

The original used a displayedLabelRef mirror to avoid re-triggering the label-sync effect when the label changed. The new code includes displayedLabel directly in the effect's dependency array, causing one extra no-op effect cycle each time the label updates. Not a bug — it stabilizes after one extra run — but less efficient than the original's ref approach.

4. Original Pusher subscription bug (fixed in PR)

The original had addReasoning in the Pusher-subscribe useEffect dependency array, but addReasoning was recreated every render (not memoized), causing the Pusher listener to re-subscribe on every render. The new code correctly delegates to stable module-level functions (subscribeToReportReasoningEvents / ConciergeReasoningStore.addReasoning), fixing this issue.


Summary: Finding #1 (missing reconnect clear) and #2 (counter comment vs. reset-to-0 behavior) are the most significant. Finding #3 is a minor efficiency note. Finding #4 is a pre-existing bug that this PR actually fixes.

@situchan
Copy link
Copy Markdown
Contributor

situchan commented Apr 9, 2026

Concierge thinking indicator not showing at all on mobile

this branch:

Screen.Recording.2026-04-09.at.12.57.14.PM.mov

main branch:

Screen.Recording.2026-04-09.at.12.54.46.PM.mov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants