fix(webview/whatsapp): IDB walk + DOM scrape + active-chat plumbing (#1017) by oxoxDev · Pull Request #1034 · tinyhumansai/openhuman

oxoxDev · 2026-04-29T20:25:32Z

Summary

WhatsApp Web memory ingest was silently dead on main: the IDB walker rejected every IndexedDB.requestData call with "Could not get index" because whatsapp_scanner/idb.rs:159 was sending an empty-string indexName that current CEF builds (146.x) reject. Slack and Telegram already shipped this exact fix months ago; only WhatsApp regressed.
Once the IDB walk worked, the DOM scrape still emitted zero rows because WhatsApp Web's HTML drifted. data-id is no longer the legacy <fromMe>_<chatId>_<msgId> triple — it's now bare msgId hex. span.selectable-text is gone (the existing span[dir] fallback already covers this; only the doc was stale). And the active chat's JID has stopped appearing on the URL, on data-id, and on any DOM attribute we could find — only the conversation header carries it.
This PR fixes both blockers and lays the merge plumbing for chatId recovery (active-chat-name extraction from header[data-testid="conversation-header"], chats-map reverse lookup, msgId-tail fallback, DOM-only chatId stamp).
End-to-end memory ingest is gated on a follow-up that closes IDB chat_names gaps (group-metadata id normalize, broadcast store walk, message-envelope pushName fallback for un-saved contacts) — see "Out of scope" below for the draft. The plumbing in this PR is defensive: when reverse-lookup misses, rows drop with no chatId exactly as before, no regression.

Problem

Issue #1017 asks for end-to-end WhatsApp Web parity. Static audit + manual smoke against pnpm dev:app exposed a chain of three blockers:

Bug 1 — IDB walk dead. Every full-tick scan logged [wa][idb] read message failed: cdp error: {"code":-32000,"message":"Could not get index"} for all four target stores (message, chat, contact, group-metadata), then full scan ok messages=0 chats=0 dom=0. RPC query openhuman.memory_recall_memories {namespace:"whatsapp-web:<acct>"} returned an empty array. The CDP spec says empty indexName means "primary key index", but the C++ backend in CEF 146 (Chrome 146.0.7680.165) rejects this; the field has to be omitted entirely. The same fix landed in slack_scanner/idb.rs:210-214 and telegram_scanner/idb.rs:210 previously, with explicit comments documenting the trap. WhatsApp drifted because the regression test wasn't there to catch it.
Bug 2 — DOM scrape returns zero. Live CDP probe (2026-04-30) revealed three drift points in WhatsApp Web's HTML since dom_snapshot.rs was last touched: (a) data-id format changed from "<fromMe>_<chatId>_<msgId>" to bare msgId hex ("AC2E44BDA…", 32 hex chars). The strict splitn(3, '_') matcher rejected every row. (b) span.selectable-text class is gone; bodies live in plain span[dir="ltr|rtl"] (the existing fallback matcher handled this; only the module doc was stale). (c) Active chat JID is no longer in URL, on data-id, or on any DOM attribute we could find — only header[data-testid="conversation-header"]'s first non-icon <span> carries the chat title.
Bug 6 (partial) — DOM↔IDB chatId correlation. Once Bugs 1 + 2 unblocked the data flow, the merge step still produced patched=0 appended=N because the DOM bare-msgId doesn't match the IDB compound _serialized directly (the bare msgId is the trailing segment after the last underscore — close but not exact-match) and DOM-only rows have no chatId to stamp. This PR plumbs both: a tail-segment fallback in the by-msg-id lookup and an active-chat-jid resolver from the conversation header reverse-looked-up against snap.chats. The plumbing is end-to-end runtime-verified for the title extraction and the merge logic; the chats-map gap that prevents Some(jid) resolution for some chat types (un-saved 1:1, broadcast lists, certain group ids) is tracked as a follow-up.

Solution

Six GPG-signed micro-commits, ordered trivial → bounded:

fix(webview/whatsapp): omit empty indexName in IndexedDB.requestData (#1017) — drop the line. Mirror the comment from the working sibling scanners. Slack and Telegram had this fix already; only WhatsApp regressed.
test(webview/whatsapp): lock the indexName-omission contract for IndexedDB.requestData (#1017) — regression test asserts the JSON payload omits indexName (so the trap can never silently come back).
docs(qa): add WhatsApp Web parity audit matrix (#1017) — initial smoke matrix.
fix(webview/whatsapp): adapt DOM scrape to current row + header markup (#1017) — accept both legacy compound and bare-msgId data-id shapes; new parse_active_chat_name walks the conversation header for the first non-icon <span> (skipping wds-icon / Material-style ligatures); module-level doc refreshed.
feat(webview/whatsapp): plumb active chat resolution + msgId-tail merge fallback (#1017) — ScanSnapshot.active_chat_name, exact → case-insensitive → substring chats-map reverse lookup, DOM-only chatId stamp, by-msg-id lookup that falls back to the trailing segment of the IDB compound id, one info! log per tick recording the resolution outcome.
docs(qa): refresh WhatsApp parity matrix with post-Bug-2 + Bug-6-plumbing state (#1017) — verdicts table updated; out-of-scope items reordered to point at the follow-up.

Runtime verification (sanitised)

Pre-fix:

[wa][<acct>] full scan ok messages=0 chats=0 dom=0

Post-Bug-1:

[wa][<acct>] full scan ok messages=20000 chats=2249 dom=0

Post-Bug-2 + Bug-6-plumbing:

[wa][<acct>] full scan ok messages=20000 chats=2251 dom=80
[wa][<acct>] active chat resolution: name=Some("<title>") → jid=… chats_in_map=2251
[wa][<acct>] dom-merge patched=0 appended=80 total=20080

The active chat resolution log shows the plumbing is in place; the jid=… value depends on whether the IDB chats map already has a name entry for the active chat, which is the gap the follow-up closes.

Out of scope (file as separate issues if not already tracked)

WhatsApp Bug 6 + 7 — IDB chat_names gaps block end-to-end memory ingest. Three sub-causes documented in .claude/scratch/whatsapp-bug-6-7-followup.md (group-metadata id normalize, broadcast store walk, message-envelope pushName fallback for un-saved 1:1 contacts). Estimated ~130 LOC across whatsapp_scanner/idb.rs + tests. To file immediately after this PR opens.
Bug 3 — Video forces download (criteria Feat/landing revamp #3 + Develop #8 Status). WhatsApp Web's <video> element is video/mp4 (H.264); CEF build lacks proprietary codecs so playback falls back to a download dialog. Build/packaging concern — not a code fix in this repo.
Bug 4 — Voice/video calls don't connect (criterion Refactor testing scripts in package.json and update dependencies #4). Needs cross-browser control test (Safari/Chrome at web.whatsapp.com) before pinning on OpenHuman vs WhatsApp Web platform limits.
Bug 5 — Voice messages with empty body. Auto-resolves once the chats map covers all chat types (gated on the Bug 6+7 follow-up).
EU-locale date parser, hardcoded Chrome/124 UA drift, per-chat mute desync — all defer.

Submission Checklist

Unit tests — cargo test --lib whatsapp_scanner is green (20 passed, including new requestdata_params_omit_index_name, split_data_id_accepts_bare_msg_id, and split_data_id_accepts_long_alnum_msg_id regression tests).
E2E / integration — Manual smoke against pnpm dev:app on this branch tip (macOS arm64) walked all 11 acceptance criteria from [Feature] webview: WhatsApp — full end-to-end parity with native app #1017, exercised IDB walk + DOM scrape + chat resolution end-to-end, and captured the diagnostic output documented in docs/qa/WHATSAPP-PARITY.md.
Doc comments — added/refreshed: whatsapp_scanner/idb.rs (CEF 146 indexName trap with cross-references to slack/telegram counterparts), dom_snapshot.rs module-level doc + split_data_id doc + parse_active_chat_name invariants + looks_like_icon_ligature heuristic, mod.rs ScanSnapshot.active_chat_name field doc + active-chat-jid resolver explanation.
Inline comments — root-cause + non-obvious-trade-off comments at each fix site; commit messages cite line numbers where the trap lives.

Impact

Runtime: macOS / Windows / Linux desktop. Affects only the WhatsApp scanner — no shared-module changes. Other migrated providers (slack, telegram, discord, browserscan) are unchanged.
Compatibility: no migration. Accounts that were silently logging zero memory will start ingesting metadata on first scan tick after upgrade. The memory_doc_ingest path is upsert-shaped already, so re-running over old IDB messages is safe.
Performance: no new IDB reads or RPC calls. Active-chat resolution is one extra info! log per scan tick (~30s) plus an O(N) walk over the chats map (max ~2.5k entries observed in the wild) — negligible.
Security: no new attack surface. Active chat name is read from the page's own DOM by the existing DOMSnapshot.captureSnapshot call we were already making; no new injected scripts, no expanded permission grants.

Closes [Feature] webview: WhatsApp — full end-to-end parity with native app #1017
Follow-up PR(s)/TODOs:
- file the WhatsApp chat_names gaps issue (draft body redacted of personal names + numbers; ready for gh issue create).
- separate issues for video codec, voice/video calls, voice-msg empty body, locale date parser, UA drift, per-chat mute desync if/when smoke proves they need fixing on the OpenHuman side.

Summary by CodeRabbit

New Features
- Enhanced message capture with active conversation detection and display name extraction.
- Improved message ID parsing to support multiple formats.
Bug Fixes
- Fixed IndexedDB request failures caused by empty index parameters.
- Improved message body text selection accuracy and DOM parsing reliability.
- Better correlation of messages with their active conversations.
Tests
- Added unit tests for message ID validation and IndexedDB parameter handling.
Documentation
- Added WhatsApp Webview parity audit documentation with acceptance criteria and test procedures.

…inyhumansai#1017) `whatsapp_scanner/idb.rs` sent `"indexName": ""` to CDP `IndexedDB.requestData`. The CDP spec says empty string means "use the primary-key index", but the C++ backend in CEF 146 (Chrome 146.0.7680.165) rejects this with `{"code":-32000,"message":"Could not get index"}` and refuses the call entirely. All four IDB walks (`message`, `chat`, `contact`, `group-metadata`) failed every tick, the WhatsApp scanner emitted zero memory docs, and `memory_recall_memories {namespace:"whatsapp-web:<acct>"}` stayed empty. Slack and Telegram already shipped this exact fix months ago — see `slack_scanner/idb.rs:210-214` and `telegram_scanner/idb.rs:210`, both of which have an explicit comment block warning future contributors not to add the empty `indexName` back. Only WhatsApp regressed. Drop the line. Add a matching comment block referencing the sibling scanners so this stays a one-time mistake. Verified: `cargo test --lib whatsapp_scanner` is green (18 passed, including a new `requestdata_params_omit_index_name` regression test that asserts the JSON payload omits the field). Runtime verification (memory_recall_memories returning non-empty after a 30s scan tick) deferred to packaged-build smoke per `feedback_validation_test_target.md`.

…xedDB.requestData (tinyhumansai#1017) Adds `requestdata_params_omit_index_name` to `whatsapp_scanner/idb_tests.rs`. Builds the same JSON payload `read_store` sends to CDP and asserts `params.get("indexName").is_none()`. This is a regression test for the bug fixed in the previous commit: re-introducing `"indexName": ""` would silently break IDB ingestion with no compile-time signal, since the CDP call type is `Value` and the failure surfaces only at runtime as a `Could not get index` warning that's easy to miss in dev:app log noise. The test message cites `slack_scanner/idb.rs:210-214` (where the same fix was first documented) so anyone tripping the test gets the historical context immediately.

11-row acceptance-criteria audit run against `pnpm dev:app` on `main` (b11b8f3+) before the fix in this PR, then logged with post-fix expectations. Records: 7 pass / 1 partial (video forces download) / 3 fail (memory IDB ingest, calls don't connect, DOM ingest = 0 — the latter likely gated on the IDB fix). Documents the one-line `indexName` fix that this PR ships and the four out-of-scope items deferred to separate child issues: - Bug 2 (DOM = 0) — gated on Bug 1 verification - Bug 3 (video codec) — CEF build/packaging concern, not a code change in this repo - Bug 4 (calls) — needs cross-browser control test before pinning on OpenHuman - Bug 5 (voice msg empty body) — auto-resolves once Bug 1 IDB walks succeed Sign-off block included for the runtime-smoke verification on a packaged build.

tinyhumansai#1017) Bug 2 in tinyhumansai#1017's matrix: post-Bug-1 the IDB walk worked but the DOM scan still emitted zero rows. Live CDP probe (2026-04-30) showed three drift points in WhatsApp Web's HTML since `dom_snapshot.rs` was last touched: 1. **`data-id` format** — message rows used to expose `"<fromMe>_<chatId>_<msgId>"`. Current builds publish only the bare msgId hex (e.g. `"AC2E44BDA…"`, 32 hex chars). The strict `splitn(3, '_')` matcher rejected every row → `parse_rows` returned empty. `split_data_id` now accepts both shapes; `from_me` and `chat_id` come back empty for the bare format and the merge in `mod.rs` reverse-looks them up by msgId-tail / active-chat header. 2. **`span.selectable-text` class** — the body text is now rendered without that class. The fallback `span[dir="ltr|rtl"]` matcher in `find_body` already covered this, but the doc and module-level comment were stale. 3. **Active chat name extraction** — modern WhatsApp Web omits chat JID from the URL, from `data-id`, and from any DOM attribute we could find. The only DOM signal that carries it is `header[data-testid="conversation-header"]`'s first non-icon `<span>`. New `parse_active_chat_name` walks the header subtree, skipping Material/`wds-icon` ligature spans (`wds-ic-search`, `wds-ic-disappearing-messages`, etc.) so the chat title wins. Returned alongside rows + hash from `capture_messages` so the caller can reverse-lookup chat JID via the IDB-side chats map. New tests: `split_data_id_accepts_bare_msg_id`, `split_data_id_accepts_long_alnum_msg_id`, plus an extended reject case for non-message hooks. All pass (10/10 in `dom_snapshot::tests`). Verified at runtime: `[wa][<acct>] full scan ok messages=20000 chats=2249 dom=80` post-fix (was `dom=0` pre-fix). Memory ingest is still gated on the chats-map reverse lookup (see follow-up issue tracking the IDB chat_names gaps); this commit is the DOM-side enabler.

…ge fallback (tinyhumansai#1017) Once Bug 2 unblocked the DOM scan, the merge step still produced `patched=0 appended=N` every tick. Two reasons, both addressed here: 1. **IDB id != DOM data-id** — IDB stores message id as the compound `_serialized` (`"false_<chat_jid>_<msgId>"`). DOM data-id is now bare msgId hex (e.g. `"AC2E44BDA…"`). The `by_msg_id` lookup in `emit_snapshot` first tries the full IDB id, then falls back to its trailing segment after the last underscore — that segment is the bare msgId for legacy compound ids and a no-op for already-bare ids, so both paths converge. 2. **DOM-only rows have no chatId** — the bare-msgId DOM rows do not carry chat context, so when the merge appends them as `bodySource=dom-only`, `chatId` was `Null`. `emit_grouped_whatsapp` rejects rows whose `chatId` is empty, so every DOM-only body got dropped on the floor. Added an active-chat-jid resolution step ahead of the merge: - `ScanSnapshot` gains an `active_chat_name` field, populated from the conversation header in `dom_snapshot::capture_messages`. - `emit_snapshot` reverse-looks-up the name in `snap.chats` (which the IDB walk populates from chat / contact / group-metadata stores) with exact then case-insensitive then substring fallback. Substring-match only wins when there is exactly one candidate so we do not cross-attribute on common tokens. - DOM-only appended rows now stamp the resolved jid into `chatId` when no DOM-side chatId exists. - One `info!` log per scan tick records the resolution outcome so smoke can tell at a glance whether the lookup is finding a hit. The plumbing is defensive — if the reverse-lookup returns `None` (un-saved 1:1 contact, group whose subject did not reach `chats`, broadcast list which we do not yet walk), DOM-only rows still flow through with `chatId=Null` and get dropped at `emit_grouped_whatsapp` exactly as before. No regression for chats whose IDB entry already had the right name. Bug 2 verified at runtime; the chats-map reverse lookup is gated on a follow-up that closes the `chat_names` gaps in `idb.rs` (group-metadata id normalize, broadcast store walk, message-envelope pushName fallback for un-saved contacts).

…bing state (tinyhumansai#1017) Original audit doc only covered Bug 1 (the IDB indexName fix). After landing the DOM-side fixes (Bug 2 — bare-msgId data-id, conversation-header active chat extraction with icon-ligature skip) and the partial Bug 6 plumbing (active_chat_name reverse lookup, msgId-tail merge fallback, DOM-only chatId stamp), the matrix needed: - A new Bug 2 row in the "Fixes shipped" table with root cause, fix shape, and runtime verification. - A Bug 6 (partial) row that calls out the plumbing this PR landed and the IDB chat_names gaps that block end-to-end memory ingest. - An updated "Out of scope" section that reorders the deferred items, replaces the now-shipped Bug 2 entry with the Bug 6+7 follow-up tracker, and points at the draft issue body in `.claude/scratch/`. - A refreshed sign-off recording the runtime status + remaining action items. The doc deliberately doesn't include real chat / contact / group names — those were redacted from the smoke transcript when drafting the follow-up issue and are treated the same way here.

coderabbitai · 2026-04-29T20:25:46Z

📝 Walkthrough

Walkthrough

Updates WhatsApp scanner's DOM message capture to return active conversation display names, adds support for both legacy compound and new bare message ID formats in DOM parsing, removes empty indexName from IndexedDB requests, and refactors snapshot reconciliation to resolve active chat context for improved message correlation.

Changes

Cohort / File(s)	Summary
DOM Message Parsing & Active Chat `app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs`	Enhanced `capture_messages` to return `Option<String>` for active conversation display name parsed from conversation-header DOM element. Extended message-row parsing to accept both legacy compound `data-id` format (fromMe_chatId_msgId) and new bare msgId format with strict validation. Improved body-text selection and added fallback span matching. Added unit tests for new msgId formats and validation edge cases.
IndexedDB Request Fixes `app/src-tauri/src/whatsapp_scanner/idb.rs`, `app/src-tauri/src/whatsapp_scanner/idb_tests.rs`	Removed empty-string `indexName` field from `IndexedDB.requestData` CDP payload (CEF 146 compatibility fix). Added test to enforce `indexName` absence in requests.
Snapshot Integration & Reconciliation `app/src-tauri/src/whatsapp_scanner/mod.rs`	Updated `ScanSnapshot` struct to include `active_chat_name` field. Implemented active-chat JID resolution with precedence rules (exact > case-insensitive > substring matching). Enhanced message body merging to extract bare msgId tails and support alternate lookup keys for DOM rows. Added `chatId` stamping for appended DOM rows using resolved `active_chat_jid`.
QA Documentation `docs/qa/WHATSAPP-PARITY.md`	Added comprehensive WhatsApp Webview parity audit documentation: verdict legend, acceptance-criteria table covering Auth, Messaging, Media, Calls, Notifications, IDB/DOM ingestion, Status, and Session persistence; documented shipped fixes and known gaps; included smoke-run procedure and tester sign-off.

Sequence Diagram

sequenceDiagram
    participant DOM as DOM Snapshot
    participant Parser as Message Parser
    participant Snapshot as ScanSnapshot
    participant Resolver as Chat Resolver
    participant IDB as IDB Store
    participant Merger as Message Merger

    DOM->>Parser: capture_messages()
    Parser->>Parser: Parse message rows<br/>(bare msgId or compound)
    Parser->>Parser: Extract active_chat_name<br/>from header[data-testid]
    Parser-->>Snapshot: Return (messages, msgCount, active_chat_name)
    
    Snapshot->>Resolver: Resolve active_chat_jid<br/>from chats map
    Resolver->>Resolver: Match name via<br/>exact/case-insensitive/substring
    Resolver-->>Snapshot: active_chat_jid
    
    Snapshot->>IDB: read_store (no empty indexName)
    IDB-->>Snapshot: IDB records
    
    Snapshot->>Merger: Merge DOM + IDB messages
    Merger->>Merger: Extract bare msgId tail<br/>from DOM data-id
    Merger->>Merger: Lookup in IDB using<br/>alternate key
    Merger->>Merger: Stamp chatId using<br/>active_chat_jid
    Merger-->>Snapshot: Reconciled messages

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

[Feature] webview: WhatsApp — full end-to-end parity with native app #1017: The changes directly address WhatsApp scanner improvements including message ID format support, active chat context extraction, and IndexedDB request fixes that align with the parity feature objectives.

Suggested reviewers

graycyrus

Poem

🐰 A scanner hops through DOM with care,
Bare msgIds dancing in the air,
Active chats now wear their name,
IndexedDB fixed—no empty blame,
Messages merged with perfect aim! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: IndexedDB walk fix, DOM scrape updates, and active-chat plumbing for WhatsApp Web memory ingest.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 4/5 reviews remaining, refill in 12 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs`:
- Around line 274-290: The current heuristic in looks_like_icon_ligature is too
broad and rejects real chat titles; change it to only detect icons by explicit
markers (keep the existing checks for "wds-ic-" and "wds-icon" prefixes) and
remove the generic "single lowercase token" rule; instead, rely on DOM context
in parse_active_chat_name (e.g., check element.class names like
"material-icons", "wds-ic", "wds-icon", or parent wrapper classes/attributes
that indicate an icon) and only call looks_like_icon_ligature when those
explicit icon markers are present so normal lowercase titles (e.g., "mom",
"family") are not filtered out.

In `@app/src-tauri/src/whatsapp_scanner/mod.rs`:
- Around line 354-355: The fast-tick branch is discarding the active_chat_name
from the captured tuple (let (rows, hash, _active_chat_name) = captured?) so DOM
rows forwarded on the 2s path lack a chatId and are skipped by
emit_grouped_whatsapp(); update the handling where captured is unpacked to
preserve and apply active_chat_name (use the existing variable name
active_chat_name) — e.g., include active_chat_name when converting/packaging
dom_messages or attach it to each DomMessage JSON before forwarding to
emit_grouped_whatsapp() so fast ticks carry the same chat stamping as the
full-scan path.
- Around line 493-535: The active_chat_jid resolution fails because the code
assumes each chat value is an object with a "name" field (chat.get("name"))
while snap.chats currently stores jid → Value::String; update the lookup in the
active_chat_jid closure to handle both shapes by extracting the display name via
either chat.get("name").and_then(|v| v.as_str()) or chat.as_str() (falling back
from object → string), and apply the same dual-shape extraction wherever chat
names are checked later; alternatively, normalize snap.chats in scan_once() to
store jid → {"name": ...} so active_chat_jid and subsequent chat-name lookups
(symbols: active_chat_jid, snap.chats, scan_once) see consistent {"name": ...}
objects.

In `@docs/qa/WHATSAPP-PARITY.md`:
- Line 73: In the QA table row the inline selector `span[dir="ltr|rtl"]` is
being treated as a Markdown table delimiter; update that cell to escape the pipe
(e.g. change `span[dir="ltr|rtl"]` to `span[dir="ltr\|rtl"]`) so the row renders
with the correct number of columns—look for the exact string
`span[dir="ltr|rtl"]` in the row and replace it with the escaped version.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af2e6ce6-c237-40bf-981b-f3582ed7de44

📥 Commits

Reviewing files that changed from the base of the PR and between 394bfcd and 6d64bd7.

📒 Files selected for processing (5)

app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs
app/src-tauri/src/whatsapp_scanner/idb.rs
app/src-tauri/src/whatsapp_scanner/idb_tests.rs
app/src-tauri/src/whatsapp_scanner/mod.rs
docs/qa/WHATSAPP-PARITY.md

coderabbitai · 2026-04-29T20:31:03Z

+/// Returns true when `s` looks like a Material/WhatsApp icon ligature name
+/// (e.g. `wds-ic-search`, `wds-ic-disappearing-messages`, `material-icons`,
+/// `arrow_forward`). These appear as the first SPAN inside icon wrappers
+/// and would otherwise win the chat-title race in `parse_active_chat_name`.
+///
+/// Heuristic: starts with `wds-ic-` / `wds-icon` (WhatsApp Design System
+/// icon prefix), or is a single token with no whitespace whose chars are
+/// all `[a-z0-9_-]` (Material Icon ligature shape).
+fn looks_like_icon_ligature(s: &str) -> bool {
+    if s.starts_with("wds-ic-") || s.starts_with("wds-icon") {
+        return true;
+    }
+    !s.is_empty()
+        && !s.contains(char::is_whitespace)
+        && s.chars()
+            .all(|c| c.is_ascii_lowercase() || c.is_ascii_digit() || c == '_' || c == '-')
+}


⚠️ Potential issue | 🟠 Major

The icon heuristic is broad enough to reject real chat titles.

looks_like_icon_ligature() currently treats any single lowercase token as an icon. That means perfectly valid titles like mom, family, anushka, or number-like handles will be skipped before parse_active_chat_name() can return them. Please key this off actual icon markers (wds-ic-, wds-icon, class/parent context, etc.) instead of generic lowercase text.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs` around lines 274 - 290, The current heuristic in looks_like_icon_ligature is too broad and rejects real chat titles; change it to only detect icons by explicit markers (keep the existing checks for "wds-ic-" and "wds-icon" prefixes) and remove the generic "single lowercase token" rule; instead, rely on DOM context in parse_active_chat_name (e.g., check element.class names like "material-icons", "wds-ic", "wds-icon", or parent wrapper classes/attributes that indicate an icon) and only call looks_like_icon_ligature when those explicit icon markers are present so normal lowercase titles (e.g., "mom", "family") are not filtered out.

coderabbitai · 2026-04-29T20:31:03Z

+fn requestdata_params_omit_index_name() {
+    // Regression guard for Bug 1: passing `indexName: ""` to
+    // `IndexedDB.requestData` makes CEF 146 reject the call with
+    // "Could not get index". The field must be omitted entirely.
+    // Same constraint observed in slack_scanner/idb.rs:210-214 and
+    // telegram_scanner/idb.rs:210.
+    let params = json!({
+        "securityOrigin": "https://web.whatsapp.com",
+        "databaseName": "model-storage",
+        "objectStoreName": "message",
+        "skipCount": 0i64,
+        "pageSize": 500i64,
+    });
+    assert!(
+        params.get("indexName").is_none(),
+        "indexName must be omitted entirely - passing empty string is rejected by CEF 146 with 'Could not get index' (see slack_scanner/idb.rs:210-214)"
+    );


⚠️ Potential issue | 🟡 Minor

This regression test never exercises the production request builder.

Right now it only asserts that a hand-written json! literal omits indexName. If read_store() starts sending "indexName": "" again, this test will still stay green. Please build the params through the same helper/path that production uses so the regression is actually covered.

coderabbitai · 2026-04-29T20:31:03Z

+    let (rows, hash, _active_chat_name) = captured?;
    let dom_messages: Vec<Value> = rows.iter().map(dom_snapshot::DomMessage::to_json).collect();


⚠️ Potential issue | 🟠 Major

Fast DOM ticks are still dropping the only chat-resolution signal.

Discarding active_chat_name here means the 2s path keeps forwarding bare-ID DOM rows without a chatId, and emit_grouped_whatsapp() will continue to skip them. If fast ticks are meant to stay ingest-capable/live, this result needs the same active-chat stamping as the full-scan path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/src-tauri/src/whatsapp_scanner/mod.rs` around lines 354 - 355, The fast-tick branch is discarding the active_chat_name from the captured tuple (let (rows, hash, _active_chat_name) = captured?) so DOM rows forwarded on the 2s path lack a chatId and are skipped by emit_grouped_whatsapp(); update the handling where captured is unpacked to preserve and apply active_chat_name (use the existing variable name active_chat_name) — e.g., include active_chat_name when converting/packaging dom_messages or attach it to each DomMessage JSON before forwarding to emit_grouped_whatsapp() so fast ticks carry the same chat stamping as the full-scan path.

coderabbitai · 2026-04-29T20:31:03Z

+    // Resolve the active chat's JID from its display name (parsed from the
+    // conversation header). Modern WhatsApp Web doesn't put the chat JID
+    // anywhere on individual message rows or in the URL, so this is the
+    // only signal we have. The IDB-side `chats` map has `name → jid` (we
+    // store it as `jid → {name, …}`, so iterate). Match prefers exact
+    // case-sensitive equality and falls back to case-insensitive; ignore
+    // ambiguous matches (multiple chats with the same display name) so we
+    // don't mis-attribute messages.
+    let active_chat_jid: Option<String> = snap.active_chat_name.as_deref().and_then(|name| {
+        let name_lc = name.to_ascii_lowercase();
+        let mut exact: Vec<&str> = Vec::new();
+        let mut ci: Vec<&str> = Vec::new();
+        let mut substring: Vec<&str> = Vec::new();
+        for (jid, chat) in snap.chats.iter() {
+            let chat_name = chat.get("name").and_then(|v| v.as_str()).unwrap_or("");
+            if chat_name == name {
+                exact.push(jid);
+            } else if !chat_name.is_empty() && chat_name.to_ascii_lowercase() == name_lc {
+                ci.push(jid);
+            } else if !chat_name.is_empty()
+                && (chat_name.to_ascii_lowercase().contains(&name_lc)
+                    || name_lc.contains(&chat_name.to_ascii_lowercase()))
+            {
+                substring.push(jid);
+            }
+        }
+        // Prefer exact > case-insensitive > substring. Substring only wins
+        // when there's exactly one candidate (avoids cross-attribution when
+        // many chats share a token like a common first name).
+        match (exact.len(), ci.len(), substring.len()) {
+            (1, _, _) => Some(exact[0].to_string()),
+            (0, 1, _) => Some(ci[0].to_string()),
+            (0, 0, 1) => Some(substring[0].to_string()),
+            _ => None,
+        }
+    });
+    log::info!(
+        "[wa][{}] active chat resolution: name={:?} → jid={:?} chats_in_map={}",
+        account_id,
+        snap.active_chat_name,
+        active_chat_jid,
+        snap.chats.len()
+    );


⚠️ Potential issue | 🟠 Major

active_chat_jid never resolves with the current snap.chats representation.

This block assumes each chat entry is an object with a name field, but scan_once() still stores snap.chats as jid -> Value::String(display_name). On a string value, chat.get("name") is always None, so the resolver never finds a candidate and the later DOM-only chatId stamp still collapses to null. Please either normalize snap.chats to {"name": ...} at ingestion time or read chat.as_str() here (and in the later chat-name lookups).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/src-tauri/src/whatsapp_scanner/mod.rs` around lines 493 - 535, The active_chat_jid resolution fails because the code assumes each chat value is an object with a "name" field (chat.get("name")) while snap.chats currently stores jid → Value::String; update the lookup in the active_chat_jid closure to handle both shapes by extracting the display name via either chat.get("name").and_then(|v| v.as_str()) or chat.as_str() (falling back from object → string), and apply the same dual-shape extraction wherever chat names are checked later; alternatively, normalize snap.chats in scan_once() to store jid → {"name": ...} so active_chat_jid and subsequent chat-name lookups (symbols: active_chat_jid, snap.chats, scan_once) see consistent {"name": ...} objects.

coderabbitai · 2026-04-29T20:31:03Z

+| Bug | Root cause | Fix | File:line | Verified |
+|-----|-----------|-----|-----------|----------|
+| 1 | `whatsapp_scanner/idb.rs:159` sent `"indexName": ""` to CDP `IndexedDB.requestData`. CEF 146 backend rejects empty-string with `{"code":-32000,"message":"Could not get index"}`. CDP spec says empty string == primary-key index, but the C++ backend requires the field UNSET (omitted entirely). All 4 IDB stores (`message`, `chat`, `contact`, `group-metadata`) failed; scanner emitted zero memory docs; `whatsapp-web:<acct>` namespace stayed empty. | Drop the `"indexName": ""` line from the JSON params. Add a comment block mirroring the working pattern already documented in `slack_scanner/idb.rs:210-214` and `telegram_scanner/idb.rs:210` (both have explicit notes). Slack + Telegram had this fix already; only WhatsApp regressed. | `app/src-tauri/src/whatsapp_scanner/idb.rs:152-167` (1-line deletion + 6-line comment) | ✅ runtime: post-fix log shows `[wa][<acct>] full scan ok messages=20000 chats=2249` (was `0/0` pre-fix). Plus `cargo test --lib whatsapp_scanner` 20/20 (incl. new `requestdata_params_omit_index_name` regression test). |
+| 2 | Once Bug 1 unblocked the IDB walk, `dom_snapshot::parse_rows` still returned 0 — three drift points in WhatsApp Web's HTML had landed since the parser was last touched. (a) `data-id` is no longer `<fromMe>_<chatId>_<msgId>` — it's bare msgId hex (e.g. `AC2E44BDA…`, 32 hex chars). (b) `span.selectable-text` class gone; bodies live in plain `span[dir="ltr|rtl"]` (existing fallback already covers this — only the doc was stale). (c) Active chat JID is no longer in URL, on `data-id`, or on any DOM attribute we could find — only the conversation header carries it. | (a) `split_data_id` accepts both legacy compound and bare-msgId-hex shapes. Bare format returns `(false, "", msg_id)` and the merge in `mod.rs` recovers the missing fields by msgId-tail and active-chat reverse-lookup. (b) Module-level doc comment refreshed to mention both `selectable-text` and `dir` matchers. (c) New `parse_active_chat_name` walks `header[data-testid="conversation-header"]` for the first non-icon `<span>`'s text, skipping `wds-icon` / Material-style ligatures so the chat title wins. | `app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs` (split_data_id rewrite + parse_active_chat_name + looks_like_icon_ligature + 4 new tests) | ✅ runtime: post-fix log shows `[wa][<acct>] full scan ok … dom=N` with N>0 (was `dom=0` pre-fix); active chat name extracted cleanly for 1:1 (`"Anushka"`-shape), group (`"<group title>"`-shape), broadcast (`"<broadcast title>"`-shape). |


⚠️ Potential issue | 🟡 Minor

Escape the | inside the Bug 2 table row.

span[dir="ltr|rtl"] is being parsed as an extra table delimiter here, so the row renders with too many cells and the trailing columns get misaligned/truncated. Escaping that pipe (for example ltr\|rtl) will keep the QA matrix intact.

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 73-73: Table column count
Expected: 5; Actual: 6; Too many cells, extra data will be missing

(MD056, table-column-count)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/qa/WHATSAPP-PARITY.md` at line 73, In the QA table row the inline selector `span[dir="ltr|rtl"]` is being treated as a Markdown table delimiter; update that cell to escape the pipe (e.g. change `span[dir="ltr|rtl"]` to `span[dir="ltr\|rtl"]`) so the row renders with the correct number of columns—look for the exact string `span[dir="ltr|rtl"]` in the row and replace it with the escaped version.

…inyhumansai#1017) (tinyhumansai#1034)

oxoxDev added 6 commits April 30, 2026 00:37

oxoxDev requested a review from a team April 29, 2026 20:25

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

senamakel merged commit 975614d into tinyhumansai:main Apr 29, 2026
15 checks passed

This was referenced May 6, 2026

feat(channels): expose WhatsApp Web data to agent via structured RPC API #1308

Merged

fix(channels): fix WhatsApp structured ingest pipeline + Memory page sync status #1326

Merged

AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026

fix(webview/whatsapp): IDB walk + DOM scrape + active-chat plumbing (t…

f015b8b

…inyhumansai#1017) (tinyhumansai#1034)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(webview/whatsapp): IDB walk + DOM scrape + active-chat plumbing (#1017)#1034

fix(webview/whatsapp): IDB walk + DOM scrape + active-chat plumbing (#1017)#1034
senamakel merged 6 commits into
tinyhumansai:mainfrom
oxoxDev:feat/1017-whatsapp-parity-audit

oxoxDev commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let (rows, hash, _active_chat_name) = captured?;
		let dom_messages: Vec<Value> = rows.iter().map(dom_snapshot::DomMessage::to_json).collect();

Conversation

oxoxDev commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Runtime verification (sanitised)

Out of scope (file as separate issues if not already tracked)

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oxoxDev commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading