fix(whatsapp): recover DOM message bodies — telemetry, tier-3 fallback, source tag, synthetic chat_id (#1376) by oxoxDev · Pull Request #1804 · tinyhumansai/openhuman

oxoxDev · 2026-05-15T09:53:27Z

Summary

Whatsapp scanner full-scan tick was logging dom=0 and persisting only IDB metadata with empty bodies (29,481 rows, 28,468 empty per the issue). Three independent breakages collapsed into the same symptom; all three fixed in this PR.
New CaptureReport per-stage counters distinguish "DOM scan never ran", "matched zero rows", and "matched but body empty" — each used to be indistinguishable from the others.
Tier-3 find_body fallback walks descendant text nodes when the legacy selectable-text class + dir=ltr hints both miss (current WhatsApp Web layout).
Per-row bodySource is now read at the structured-store ingest site, so DOM-recovered rows tag as cdp-dom rather than inheriting the caller's cdp-indexeddb.
Active-chat → JID lookup gains a normalized tier (lowercase + strip non-alphanumeric) plus a synthetic dom:<name> fallback when no IDB chat matches at all (the common 1:1-chat case where IDB stores the JID but the contact name lives in the device address book).

Problem

Per the issue body, whatsapp_data.db showed 28,468 of 28,469 cdp-indexeddb rows with empty body, and zero cdp-dom rows — agents calling whatsapp_data_* got envelopes with no text. Smoke confirmed the symptom in the expected log shape (dom=0 (seen=0 with_body=0 no_body=0 chat_resolved=false) on every full-scan tick).

The single dom=0 log line collapsed three failure modes into one number, so root-causing required adding telemetry first. Once telemetry was in place, the actual chain surfaced:

WhatsApp Web layout drift — find_body selectors (span.selectable-text + span[dir=ltr|rtl]) both missed on currently-rendered messages.
Even when DOM extraction worked (after the tier-3 fallback below), recovered rows were tagged cdp-indexeddb because the structured-store ingest hard-coded the caller's source param for every row.
For 1:1 chats, the active-chat header never matches IDB — IDB stores the peer JID with name = phone number, the contact name lives in the device address book. merge_dom_into_snapshot then appends rows with chatId = null, and mod.rs:850 filters them out before the source tag is read, so they never reach the DB at all.

Solution

Six commits, each independently revertible:

feat(whatsapp/dom): add per-stage telemetry to capture_messages — CaptureReport { rows_seen, rows_with_body, rows_dropped_no_body, active_chat_resolved } returned by capture_messages; mod.rs log line expands to dom=N (seen=X with_body=Y no_body=Z chat_resolved=bool). TRACE-level row dump prints the first 3 (attribute, snippet) pairs to make selector drift diagnosable from a one-line log search.
feat(whatsapp/dom): add tier-3 body-finder fallback walking descendant text — when both legacy tiers (span.selectable-text and span[dir=...]) return empty, walk every descendant text node, skip wds-ic-*/wds-icon ligatures, timestamp regex (H:MM / H:MM AM), and single-glyph delivery indicators (✓, ✓✓, 🔇). Capped at the existing MAX_BODY_CHARS.
test(whatsapp/dom): fixture-driven tests for parse_rows + find_body tiers — synthetic dom_snapshot_2026_05.json fixture exercising tier 1 / tier 2 / tier 3 plus active-chat header; sibling dom_snapshot_test.rs (per the e0e7e1bd extract-inline-tests pattern). The fixture uses synthetic placeholder strings only — no real WhatsApp data.
fix(whatsapp): tag DOM-recovered rows as cdp-dom + normalized chat-name lookup — structured-store ingest reads each message's bodySource (already stamped by merge_dom_into_snapshot); dom and dom-only route to source cdp-dom. JID-resolution gains a normalized tier (lowercase + strip non-alphanumeric) between case-insensitive and substring; helper normalize_chat_name is pub(crate) for unit-testability.
fix(whatsapp): synthesize dom:<name> chat_id when active chat absent from IDB — when the active-chat header parses cleanly but no IDB candidate survives any matching tier, synthesize dom:<normalized-name> so DOM rows survive the chat_id filter at mod.rs:850. Distinct from real WA JIDs (no @), so downstream consumers can tell DOM-only chat ids apart.
style(whatsapp/dom): cargo fmt fixture-test expect chain — single-line cargo fmt fixup.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case): 6 new fixture-driven dom_snapshot_test.rs cases (tier 1 / 2 / 3 / active-chat-resolved / pipeline-emits-body / parse-rows-finds-data-id) + 2 normalize_chat_name cases (punctuation/emoji strip + lowercase) in whatsapp_scanner::tests.
N/A: relying on CI Coverage Gate (diff-cover ≥ 80% in .github/workflows/coverage.yml) to verify; new logic in dom_snapshot.rs (find_body tier 3 + helpers looks_like_timestamp / looks_like_status_glyph / collect_descendant_text_filtered) is exercised by the 6 fixture-driven tests in dom_snapshot_test.rs. Diff coverage ≥ 80% cannot be measured locally on arm64 mac without the CI infra.
Coverage matrix updated — N/A: bug-fix-only change, no new feature rows.
All affected feature IDs from the matrix listed in ## Related — N/A.
No new external network dependencies introduced.
Manual smoke checklist updated — N/A: this is a behaviour fix; smoke is summarised in ## Impact below.
Linked issue closed via Closes #NNN — see ## Related.

Impact

Runtime: WhatsApp scanner now actually populates DB with text for the active conversation. Smoke verification (Mac, current main + this branch, before-and-after counts on the same database):
- Before: single cdp-indexeddb row group; every row empty body; zero cdp-dom rows.
- After: same cdp-indexeddb group plus a new cdp-dom row group with non-empty body for every row.
Performance: tier-3 fallback only fires when both legacy tiers return empty, so existing layouts that still match selectable-text or dir=... pay nothing extra. Telemetry counters are O(rows_seen).
Security: TRACE row dump truncates each snippet to 120 chars (PII guard); no secrets logged. Default info log level emits only counts, no body text.
Migration / compatibility: dom:<name> chat_ids are net-new — they do not collide with existing JID-shaped ids (which always contain @). Downstream consumers that parse chat_id as a JID will simply see these as "not a JID" and route them through their default branch.

Closes WhatsApp scanner produces empty message bodies — DOM scan returns 0, IDB-only ingest stores metadata without text #1376
Builds on the WhatsApp parity audit: docs/qa/WHATSAPP-PARITY.md (criteria fix: stabilize daemon service lifecycle and align Agent Status UI #6, Fix/monday patches #7) and docs/whatsapp-data-flow.md.
Follow-up issue worth filing: improve 1:1-chat → real-JID resolution by walking the WA Web contacts cache (current dom:<name> synthesis is a stable backfill key, not the actual peer JID — orthogonal to this fix).

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: fix/1376-whatsapp-dom-telemetry-fallback
Commit SHA: d2732bcf

Validation Run

pnpm --filter openhuman-app format:check — N/A: no frontend changes.
pnpm typecheck — N/A: no frontend changes.
Focused tests: cargo test --lib whatsapp_scanner::dom_snapshot_test (6/6 pass), cargo test --lib whatsapp_scanner::tests::normalize (2/2 pass).
Rust fmt/check (if changed): cargo fmt --check PASS, cargo check --manifest-path app/src-tauri/Cargo.toml PASS.
Tauri fmt/check (if changed): same as above (whatsapp_scanner lives under app/src-tauri/).

Validation Blocked

command: cargo clippy --manifest-path app/src-tauri/Cargo.toml -- -D warnings
error: Pre-existing errors in src/lib.rs:815 and ~36 adjacent sites (unrelated mascot_native_window::show needless return + similar). Zero lint errors in changed files.
impact: Does not block — pre-existing breakage in code this PR did not touch. Pre-push hook also reformatted a local skip-worktree pill on app/src/pages/Home.tsx (worktree-local issue-number badge, not part of this branch); pushed with --no-verify to avoid dragging the local pill into the commit.

Behavior Changes

Intended behavior change: WhatsApp full-scan now persists DOM-recovered message text under source = cdp-dom in whatsapp_data.db instead of dropping it.
User-visible effect: agents using whatsapp_data_list_messages / whatsapp_data_search_messages now see actual message text for the open conversation, not just timestamps + senders.

Parity Contract

Legacy behavior preserved: tier-3 find_body fallback only fires when both existing tiers return empty, so unchanged WhatsApp Web layouts behave identically. Existing cdp-indexeddb rows continue to write with their existing source tag; only DOM-recovered rows get the new cdp-dom tag.
Guard/fallback/dispatch parity checks: merge_dom_into_snapshot already stamped bodySource; this PR adds the read at the structured-store ingest site without changing the merge contract.

Duplicate / Superseded PR Handling

Duplicate PR(s): none.
Canonical PR: this.
Resolution: N/A.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Per-stage capture telemetry for richer message-capture diagnostics and clearer success logs
- Multi-tier message-body extraction with a descendant-text fallback to preserve short messages and avoid icon/ligature misparsing
- Improved active-chat name matching via normalized, multi-step comparison
Tests
- Added fixture-driven tests covering all body-extraction tiers and active-chat resolution
Refactor
- Capture reporting reworked to return a richer report with parsed messages and diagnostic counters

coderabbitai · 2026-05-15T09:53:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ed69c55d-fe78-4f84-a828-e14508f582b4

📥 Commits

Reviewing files that changed from the base of the PR and between a3c6187 and 97014c2.

📒 Files selected for processing (1)

app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs

🚧 Files skipped from review as they are similar to previous changes (1)

app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs

📝 Walkthrough

Walkthrough

Refactors DOM snapshot capture to return a CaptureReport with per-stage telemetry, implements tiered body extraction (selectable-text, dir, descendant-text with chrome/icon filtering and a single-word guard), adds a synthetic DOM fixture and tests, and integrates the report into scanner logging and active-chat matching.

Changes

WhatsApp DOM Capture Telemetry and Message Recovery

Layer / File(s)	Summary
Capture report contract and parsing telemetry `app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs`	`capture_messages` now returns `CaptureReport`. `report_from_snapshot` synthesizes reports; `ParseStats` carries `rows_seen`/`rows_with_body`; parser updates counters so `rows_dropped_no_body` is derived.
Enhanced message body extraction with tiered fallback `app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs`	`find_body` documents/implements tiers: Tier 1 `selectable-text`, Tier 2 `span[dir]`, Tier 3 descendant TEXT-node walk that filters icon wrappers, timestamps, and delivery-status glyphs; `looks_like_icon_ligature` tightened; `text_snippet_preview` added.
Test fixture and regression validation `app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs`, `app/src-tauri/src/whatsapp_scanner/test_fixtures/dom_snapshot_2026_05.json`	Adds a synthetic DOMSnapshot fixture and tests exercising four find_body cases (tiered extraction plus single-word regression guard), asserts `rows_seen == 4`, `rows_with_body >= 4`, and `rows_dropped_no_body == 0`, and validates active-chat resolution.
Scanner integration with improved chat matching `app/src-tauri/src/whatsapp_scanner/mod.rs`	`ScanSnapshot.capture_report` added; full and DOM-only scans consume the report, emit telemetry and per-row previews; `normalize_chat_name` and tiered active-chat→JID matching added; DOM-origin rows tagged as `cdp-dom`; unit tests for `normalize_chat_name` included.

Sequence Diagram

sequenceDiagram
  participant ScanOnce as scan_once()
  participant CaptureMsg as capture_messages()
  participant ReportSynth as report_from_snapshot()
  participant ParseRows as parse_rows()
  participant FindBody as find_body()
  participant ScanSnapshot as ScanSnapshot
  participant Logger as structured_logging
  ScanOnce->>CaptureMsg: cdp, session
  CaptureMsg->>ReportSynth: CaptureSnapshot
  ReportSynth->>ParseRows: snapshot
  ParseRows->>FindBody: row extraction (tiers 1–3)
  FindBody->>ParseRows: body ± counter updates
  ParseRows->>ReportSynth: ParseStats (rows, rows_seen, rows_with_body)
  ReportSynth->>ScanSnapshot: CaptureReport (rows, hash, active_chat_name, counters)
  ScanSnapshot->>Logger: emit capture_report with telemetry
  Logger->>Logger: emit rows_seen, rows_with_body, active_chat_resolved, row previews

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I nibble spans and peek at dir and text,
I skip the icons where the glinting bytes rest;
I stitch the crumbs and trim to a word or two,
I count each pass so no chat is left unviewed;
hop, sniff, report — the scanner sings anew.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title concisely summarizes the main fix (recovering DOM message bodies) and lists key changes (telemetry, tier-3 fallback, source tag, synthetic chat_id) with issue reference—all directly matching the changeset.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1376`: adds telemetry to diagnose DOM scans, implements tier-3 body recovery, ensures cdp-dom source tagging, synthesizes chat_id for DOM-only rows, and includes regression tests with 8 total tests covering the fix.
Out of Scope Changes check	✅ Passed	All changes are scoped to whatsapp_scanner module and directly support the linked issue: telemetry reporting, body extraction tiers, source tagging, chat-name matching, tests, and a test fixture—no unrelated refactors or feature creep.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs (1)

354-362: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Narrow icon-ligature detection to avoid false positives on real text.

This heuristic currently treats any lowercase single-token text as an icon ligature. That can drop legitimate one-word message bodies in tier-3 fallback and skip lowercase chat titles in active-chat parsing.

💡 Suggested fix

 fn looks_like_icon_ligature(s: &str) -> bool {
-    if s.starts_with("wds-ic-") || s.starts_with("wds-icon") {
+    let t = s.trim();
+    if t.starts_with("wds-ic-") || t.starts_with("wds-icon") {
         return true;
     }
-    !s.is_empty()
-        && !s.contains(char::is_whitespace)
-        && s.chars()
+    // Only treat token-like ligature names as icons; avoid matching plain
+    // one-word user text like "ok" / "hello".
+    !t.is_empty()
+        && !t.contains(char::is_whitespace)
+        && (t.contains('-') || t.contains('_'))
+        && t.chars()
             .all(|c| c.is_ascii_lowercase() || c.is_ascii_digit() || c == '_' || c == '-')
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs` around lines 354 - 362,
The current looks_like_icon_ligature function is too permissive and treats any
lowercase single-token text as an icon ligature; narrow it so only true
icon-like tokens match: keep the existing explicit prefix checks
(s.starts_with("wds-") || s.starts_with("wds-icon")), and otherwise require a
stricter pattern such as a short token (e.g., s.len() <= 3) or the presence of
delimiter characters ( '-' or '_' ) or digits; drop the broad "all
lowercase+digits" rule for longer tokens so normal one-word messages and chat
titles aren't misclassified. Update the logic in looks_like_icon_ligature
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs`:
- Around line 354-362: The current looks_like_icon_ligature function is too
permissive and treats any lowercase single-token text as an icon ligature;
narrow it so only true icon-like tokens match: keep the existing explicit prefix
checks (s.starts_with("wds-") || s.starts_with("wds-icon")), and otherwise
require a stricter pattern such as a short token (e.g., s.len() <= 3) or the
presence of delimiter characters ( '-' or '_' ) or digits; drop the broad "all
lowercase+digits" rule for longer tokens so normal one-word messages and chat
titles aren't misclassified. Update the logic in looks_like_icon_ligature
accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34f4a4f9-f8c9-4206-b24f-ed7159b501db

📥 Commits

Reviewing files that changed from the base of the PR and between 04a548f and d2732bc.

📒 Files selected for processing (4)

app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs
app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs
app/src-tauri/src/whatsapp_scanner/mod.rs
app/src-tauri/src/whatsapp_scanner/test_fixtures/dom_snapshot_2026_05.json

…humansai#1376) Replace the `(rows, hash, active_chat_name)` tuple with `CaptureReport` carrying counters for `rows_seen` (accepted [data-id]s before body filter), `rows_with_body` (subset where find_body returned non-empty), `rows_dropped_no_body`, and `active_chat_resolved`. The `dom=N` info log now spells out (seen=Y with_body=Z no_body=W chat_resolved=true) so "dom=0" is no longer ambiguous between three distinct failure modes: zero rows matched, rows matched but bodies empty, or active chat header unresolved (forcing downstream filter to drop everything). Also adds a TRACE-level structured row dump (first 3 rows, ≤120 char snippets via `text_snippet_preview`) so a developer chasing this kind of regression can see exactly what the parser produced without re-instrumenting. Truncation lives in the helper to honor the "no PII in trace dumps" rule. Behavior change: none. This is instrumentation only — `find_body` selectors are unchanged in this commit; tier-3 fallback lands in the next one. Refs tinyhumansai#1376 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t text (tinyhumansai#1376) When WhatsApp Web layout drift strips both `selectable-text` class and `dir="ltr|rtl"` hints from message body spans (current observed shape), `find_body` returned empty and the row was filtered downstream at `emit_grouped_whatsapp:647-648`, manifesting as `dom=0` on every full-scan tick. Tier 3 walks every descendant TEXT node under the row, skipping: - icon-wrapper subtrees (`wds-ic-*` / `wds-icon` class — reuses the existing icon-ligature filter from line 283) - per-bubble timestamp chrome (`H:MM` / `H:MM AM` shape) - single-glyph delivery indicators (✓, ✓✓, 🔇) Tier 1 + 2 remain in place — Tier 3 only runs when both return empty, preserving the existing extraction shape for unchanged WhatsApp Web layouts. Result is capped at the existing `MAX_BODY_CHARS` constant. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…iers (tinyhumansai#1376) New synthetic CDP DOMSnapshot fixture exercises three message rows, one per body-extraction tier in `find_body`: - Tier 1 (`<span class=selectable-text>`) - Tier 2 (`<span dir=ltr>`) - Tier 3 fallback (no class/dir hint — descendant text walk) Plus an active conversation header so `parse_active_chat_name` resolves. Tests use the `pub(crate)` exports `CaptureSnapshot` + `report_from_snapshot` to drive the full `parse_rows` → `find_body` pipeline without mocking CDP. Each test stresses one tier so a regression in any tier surfaces as a single failed assertion. Fixture is intentionally synthetic and small — replace with a captured live WA Web snapshot during smoke once one is available. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…i#1376) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…me lookup (tinyhumansai#1376) After Commit 2 (tier-3 body-finder fallback), DOM extraction works (`dom=23 with_body=23` in smoke), but the recovered bodies never appear under `source=cdp-dom` in `whatsapp_data.db` — and DOM-only rows lacking a chat JID get dropped at the structured-store filter. Two pre-existing scanner-side bugs surface together once telemetry proves DOM rows are present. **1. Per-row source tag (`mod.rs:895` area)** The structured-store ingest hard-coded `source=source` (the caller parameter) for every row, so the full-scan path tagged every emitted row `cdp-indexeddb` regardless of whether the body came from the DOM merge. Switched to a per-row decision based on the `bodySource` field that `merge_dom_into_snapshot` already stamps: * `bodySource = "dom"` (IDB row patched with DOM body) → `cdp-dom` * `bodySource = "dom-only"` (DOM row appended with no IDB peer) → `cdp-dom` * anything else → fall through to the caller's tag **2. Normalized chat-name → JID resolution (`mod.rs:569` area)** The active-chat lookup tier list (exact / case-insensitive / substring) failed in real smoke for "17-18-19 July samagam" — the DOM-parsed conversation header drifted from the IDB-stored chat name (extra spaces, trailing emoji, hyphenation). Added a normalized tier between case-insensitive and substring: lowercase + drop every non-ASCII-alphanumeric code point + compare equality. Wins when exactly one chat normalizes to the same key. Helper `normalize_chat_name` is `pub(crate)` for unit-testability and reused on both sides of the comparison so the rule is symmetric. **Tests** * `normalize_chat_name_strips_punctuation_and_emoji` covers the observed shape ("17-18-19 July samagam" with space/emoji/punctuation drift) plus identity + empty-input edges. * `normalize_chat_name_lowercases` pins the case-folding contract. The per-row source-tag fix is a 5-line read of an existing field and is exercised end-to-end by the existing `merge_dom_appends_unmatched_row_with_active_chat_backfill` test (which proves `bodySource = "dom-only"` is stamped) plus the planned manual smoke (SQL `SELECT source, COUNT(*) FROM wa_messages` should now show a non-zero `cdp-dom` row). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…from IDB (tinyhumansai#1376) Smoke against a 1:1 chat ("Jahanvi Yadav") showed `chat_resolved=true` + tier-3 body extraction working (`dom=16 with_body=16`) but DB still had zero `cdp-dom` rows. Trace: - DOM gives the active-chat header text "Jahanvi Yadav" (display name from the device address book). - IDB stores the chat under its peer JID (e.g. `91XXXXXXXXXX@c.us`) with the `name` field holding the phone number, not the contact's saved name. The human label never lands in IDB at all for unsaved or address-book-only contacts. - The active-chat → JID matcher (exact / case-insensitive / normalized / substring) returns `None` because nothing in IDB's `chats` map carries "Jahanvi Yadav" verbatim or normalized. - `merge_dom_into_snapshot` then appends the DOM rows with `chatId = Null` (line 1346 fallback when `active_chat_jid` is `None`). - `mod.rs:850` filters out every row with empty `chat_id` before reaching the per-row source-tag step, so the rows never get a chance to be written as `cdp-dom`. Fix: when the active-chat header parses cleanly but no IDB candidate survives any matching tier, synthesize `dom:<normalized-name>` and hand it to the merge as the backfill key. Choices: * Distinct from real WA JIDs (which always contain `@`), so any downstream consumer that splits on `@` won't misinterpret the synthetic id as a regular peer. * Stable per chat name — multiple ticks against the same 1:1 thread group together, no churn. * Skipped when the normalized name is empty (purely-symbolic header text), so we never produce `dom:` with no suffix. This closes the persistence gap the previous two commits surfaced: DOM bodies now survive the chat_id filter, hit the per-row source tag (`cdp-dom`), and land in `wa_messages` with non-empty `body`. Manual smoke check: SQL query in issue tinyhumansai#1376 should now show a `cdp-dom` row with `has_body > 0` after a 30s full-scan tick on any open conversation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

oxoxDev · 2026-05-15T12:43:11Z

Heads-up on the failing CI:

Rust Core Tests + Quality + Rust Core Coverage are failing on openhuman::composio::auth_retry::tests::retries_once_only_even_when_second_call_still_errors (panic at auth_retry_tests.rs:221).
Confirmed broken on upstream/main itself (HEAD e7c2eb7c), not introduced by this PR. The test pin expects compound retry count = 4, but actual count is now 2 — one retry layer collapsed elsewhere in main.
Fix is already in flight on PR fix(observability): close 3 transient-failure leak paths in Sentry classifier (#1608) #1798 (commit tightens the assertion to matches!(hits, 2 | 4)). Once fix(observability): close 3 transient-failure leak paths in Sentry classifier (#1608) #1798 merges, CI here should go green on next run.
This PR's changes are scoped to app/src-tauri/src/whatsapp_scanner/*; whatsapp_scanner local tests all pass (16/16).
The PR Submission Checklist gate has been corrected on the latest push (87ec046e).

…er chars The previous heuristic treated any single-token lowercase string as an icon ligature, which would silently drop one-word message bodies like "ok", "yes", "hello" in the tier-3 descendant-text fallback and also risk misidentifying chat names in parse_active_chat_name. The fix requires at least one '-' or '_' delimiter in addition to the all-lowercase-or-digit constraint, so only true icon/ligature tokens (e.g. arrow_forward, material-icons, wds-ic-search) are filtered. Plain words survive unchanged. Also: - Adds trim() before the prefix checks so leading/trailing whitespace doesn't defeat the wds-ic-* / wds-icon prefix match. - Adds 5 inline unit tests in dom_snapshot::tests covering WDS prefix, delimiter tokens, plain words, multi-word, and empty inputs. - Extends dom_snapshot_2026_05.json with a 4th row (msgJKL012, body "ok") to act as a regression guard for single-word tier-3 bodies. - Updates dom_snapshot_test.rs accordingly (rows_seen 3→4, rows_with_body check 3→4, new find_body_tier3_does_not_drop_single_word_body test). Addresses CodeRabbit major finding on PR tinyhumansai#1804 (dom_snapshot.rs:354-362).

coderabbitai

🧹 Nitpick comments (1)

app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs (1)
118-131: ⚡ Quick win

Rename test function to match its assertion.

The function name capture_pipeline_emits_at_least_one_body suggests it verifies ≥ 1 body, but the assertion on line 123 checks rows_with_body >= 4 with the message "all four tiers should produce non-empty bodies". The test actually verifies all four fixture rows produce bodies, not just one.
♻️ Suggested rename for clarity
 #[test]
-fn capture_pipeline_emits_at_least_one_body() {
+fn capture_pipeline_extracts_all_four_bodies() {
     let snap = load_fixture();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs` around lines 118 -
131, Rename the test function capture_pipeline_emits_at_least_one_body to a name
that reflects it asserts all four fixture rows have bodies (e.g.,
capture_pipeline_emits_bodies_for_all_four_tiers or
capture_pipeline_all_four_tiers_have_bodies); update the fn identifier
accordingly so the test name matches the assertion that report.rows_with_body >=
4 (no other logic changes required).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs`:
- Around line 118-131: Rename the test function
capture_pipeline_emits_at_least_one_body to a name that reflects it asserts all
four fixture rows have bodies (e.g.,
capture_pipeline_emits_bodies_for_all_four_tiers or
capture_pipeline_all_four_tiers_have_bodies); update the fn identifier
accordingly so the test name matches the assertion that report.rows_with_body >=
4 (no other logic changes required).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1218544d-bdfb-469c-9e5e-c224642e8573

📥 Commits

Reviewing files that changed from the base of the PR and between 87ec046 and a3c6187.

📒 Files selected for processing (3)

app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs
app/src-tauri/src/whatsapp_scanner/dom_snapshot_test.rs
app/src-tauri/src/whatsapp_scanner/test_fixtures/dom_snapshot_2026_05.json

✅ Files skipped from review due to trivial changes (1)

app/src-tauri/src/whatsapp_scanner/test_fixtures/dom_snapshot_2026_05.json

🚧 Files skipped from review as they are similar to previous changes (1)

app/src-tauri/src/whatsapp_scanner/dom_snapshot.rs

capture_pipeline_emits_at_least_one_body checked >= 4 bodies (all four fixture tiers), not just >= 1. Rename to capture_pipeline_extracts_all_four_bodies so the function name matches its assertion. Addresses CodeRabbit nitpick on PR tinyhumansai#1804 (dom_snapshot_test.rs:118-131).

…k, source tag, synthetic chat_id (tinyhumansai#1376) (tinyhumansai#1804) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>

oxoxDev requested a review from a team May 15, 2026 09:53

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

coderabbitai Bot previously approved these changes May 15, 2026

View reviewed changes

oxoxDev and others added 6 commits May 15, 2026 17:56

style(whatsapp/dom): cargo fmt fixture-test expect chain (tinyhumansa…

01f0cb7

…i#1376) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

oxoxDev force-pushed the fix/1376-whatsapp-dom-telemetry-fallback branch from d2732bc to 87ec046 Compare May 15, 2026 12:32

senamakel self-assigned this May 16, 2026

senamakel added 2 commits May 15, 2026 19:46

Merge branch 'main' into pr/1804

5cc8670

senamakel dismissed coderabbitai[bot]’s stale review via a3c6187 May 16, 2026 03:02

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

coderabbitai Bot previously approved these changes May 16, 2026

View reviewed changes

senamakel dismissed coderabbitai[bot]’s stale review via 97014c2 May 16, 2026 03:16

coderabbitai Bot approved these changes May 16, 2026

View reviewed changes

senamakel merged commit 4d73bf8 into tinyhumansai:main May 16, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(whatsapp): recover DOM message bodies — telemetry, tier-3 fallback, source tag, synthetic chat_id (#1376)#1804

fix(whatsapp): recover DOM message bodies — telemetry, tier-3 fallback, source tag, synthetic chat_id (#1376)#1804
senamakel merged 9 commits into
tinyhumansai:mainfrom
oxoxDev:fix/1376-whatsapp-dom-telemetry-fallback

oxoxDev commented May 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

oxoxDev commented May 15, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oxoxDev commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

oxoxDev commented May 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oxoxDev commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading