fix(memory_tree): gate ingest on source_id so summariser tree never sees a source twice by senamakel · Pull Request #1353 · tinyhumansai/openhuman

senamakel · 2026-05-08T01:13:34Z

Summary

Add a source-level idempotency gate to the memory-tree ingest pipeline so the summariser tree can't see the same (source_kind, source_id) twice.
New mem_tree_ingested_sources table claims a source on first ingest; subsequent ingest_chat / ingest_email / ingest_document calls short-circuit.
Authoritative claim runs inside the same transaction as chunk / score / job writes, so two concurrent ingests of the same source can't both pass.

Problem

Memory items (documents, chat batches, email threads) are append-only — once a source has been ingested, the file is never updated, only added to. But the existing chunk-level idempotency only catches identical (source_kind, source_id, seq, content) triples. If the same logical source is ingested twice through any path that yields different chunk content (whitespace drift, re-canonicalisation, partial replay) it flows back through extract → admit → buffer → seal, duplicating the same content into the summariser tree. We were observing duplicates in the graph as a result.

Solution

src/openhuman/memory/tree/store.rs — new mem_tree_ingested_sources table keyed on (source_kind, source_id), plus is_source_ingested (best-effort lookup) and claim_source_ingest_tx (transactional INSERT OR IGNORE).
src/openhuman/memory/tree/ingest.rs —
- Each ingest_* entry point checks is_source_ingested before canonicalisation and short-circuits on hit.
- Inside persist's transaction, claim_source_ingest_tx is the authoritative gate. If the row already exists the closure returns early and nothing is committed.
- IngestResult got a new already_ingested: bool field (defaulted via serde for wire compatibility).
src/openhuman/memory/slack_ingestion/ops.rs — updated empty-bucket short-circuit for the new field.
New test second_ingest_of_same_source_id_is_short_circuited proves a second ingest_document under the same source_id (even with different body) writes nothing.

Trade-off: documents are append-only by design, so the gate uses source_id alone — even mutated bodies under the same id are rejected. That matches the data model and is what we want for the summariser tree.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) per docs/TESTING-STRATEGY.md
N/A: backend-only Rust change; coverage gate measured by CI on changed Rust lines via cargo-llvm-cov
N/A: behaviour-only change to existing memory-tree feature
N/A: no new feature IDs introduced
No new external network dependencies introduced (mock backend used per docs/TESTING-STRATEGY.md)
N/A: does not touch release-cut surfaces
N/A: no linked issue

Impact

Runtime: desktop core (Rust). One additive SQLite table on the existing chunks.db. Schema is created via CREATE TABLE IF NOT EXISTS so existing workspaces upgrade transparently.
Behaviour change: a re-ingest of an already-ingested (source_kind, source_id) is now a no-op returning IngestResult { already_ingested: true, .. }. Callers that today re-drove ingest as a way to refresh content will need a different mechanism — but per the data model, "memory items are final once ingested", so this is the intended contract.
Performance: extra cheap SQLite lookup at the head of each ingest_*; saves all downstream LLM extraction cost on duplicates.
Security / migration / compatibility: none beyond the new table.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: fix/memory-perm
Commit SHA: cd7b29a56f2c9ac868c3bece2738e20ee869d7ae

Validation Run

pnpm --filter openhuman-app format:check
pnpm typecheck
Focused tests: cargo test --lib openhuman::memory::tree::ingest:: — 5/5 passing
Rust fmt/check (if changed): cargo fmt --check + cargo check --manifest-path Cargo.toml
N/A: Tauri shell not touched

Validation Blocked

command: N/A
error: N/A
impact: N/A

Behavior Changes

Intended behavior change: re-ingest of the same (source_kind, source_id) becomes a no-op.
User-visible effect: the summariser tree no longer accumulates duplicate content for sources replayed through ingest.

Parity Contract

Legacy behavior preserved: chunk-level idempotency guard inside persist is unchanged; the new gate sits in front of it.
Guard/fallback/dispatch parity checks: IngestResult got a new field with #[serde(default)] so old callers / persisted JSON deserialise unchanged.

Duplicate / Superseded PR Handling

Duplicate PR(s): N/A
Canonical PR: N/A
Resolution (closed/superseded/updated): N/A

Summary by CodeRabbit

New Features
- Source-level deduplication for documents: re-submitted documents are detected and skipped early.
- Ingest results now explicitly indicate when a source was already processed.
Bug Fixes
- Prevents duplicate storage, chunk writes, and extraction-job enqueueing for already-ingested sources.

coderabbitai · 2026-05-08T01:17:56Z

📝 Walkthrough

Walkthrough

Adds source-level deduplication for document ingests: new ingested-sources table, public pre-check and transactional claim APIs, IngestResult.already_ingested field and constructors, early pre-check short-circuit for documents, transactional gate in persist, tests, and a Slack empty-bucket result update.

Changes

Source-level deduplication in memory ingestion

Layer / File(s)	Summary
Database schema `src/openhuman/memory/tree/store.rs`	New `mem_tree_ingested_sources` table with `(source_kind, source_id)` primary key and `ingested_at_ms`.
Storage deduplication APIs `src/openhuman/memory/tree/store.rs`	Adds `is_source_ingested` (best-effort SELECT) and `claim_source_ingest_tx` (transactional INSERT OR IGNORE) to check and claim source ingests.
IngestResult shape & constructors `src/openhuman/memory/tree/ingest.rs`	Adds `pub already_ingested: bool` (`#[serde(default)]`) and constructors for already-ingested vs. normal results.
Early dedup checks `src/openhuman/memory/tree/ingest.rs`	Adds `already_ingested` helper (spawn_blocking -> `store::is_source_ingested`) and a document-only early-return that short-circuits canonicalisation/chunking/persist when true.
Transactional persistence gating `src/openhuman/memory/tree/ingest.rs`	`persist` uses `claim_source_ingest_tx` inside the DB transaction; on claim failure returns `Ok(None)` to skip chunk upserts/enqueue and maps that to `IngestResult::already_ingested`.
Tests `src/openhuman/memory/tree/ingest.rs`	Adds test `second_ingest_of_same_source_id_is_short_circuited` asserting second ingest returns `already_ingested`, writes 0 chunks, and store retains only first ingest's chunks after drain.
Slack integration `src/openhuman/memory/slack_ingestion/ops.rs`	Empty-bucket `ingest_bucket` early-return now explicitly sets `already_ingested: false` with zeroed chunk counts and empty `chunk_ids`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

tinyhumansai/openhuman#325: Touches the memory ingestion path and calls ingest_document, related to document ingest behavior.

Suggested reviewers

graycyrus

Poem

🐰 I nibble logs and mark the seeds sown,
A table remembers what once was known.
Claims and checks hop in tidy rows,
No twin data sprouts where the true source goes.
Hooray — one-and-done for each little tone!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding source-level idempotency gating to prevent the summariser tree from seeing duplicate sources.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ees a source twice Memory items (documents, chat batches, email threads) are append-only — once `(source_kind, source_id)` is ingested, re-ingesting must not flow through extract → admit → buffer → seal again, otherwise the same content lands in the summariser tree twice. - New `mem_tree_ingested_sources` table keyed on `(source_kind, source_id)`. - `ingest_chat` / `ingest_email` / `ingest_document` short-circuit on the fast-path lookup before canonicalisation. - `persist` claims the row inside the same transaction as the chunk / score / job writes via `INSERT OR IGNORE`, so two concurrent ingests of the same source can't both pass the gate. - `IngestResult.already_ingested` surfaces the no-op to callers.

Chat (`slack:{conn}`) and email (`gmail:{participants}`) `source_id`s are stream identifiers — many batches / threads accumulate under one source over time. The previous source-level gate made every bucket after the first a no-op, breaking the slack workspace tree fill / seal cascade and the gmail per-participant append flow, and turning `read_rpc::tests::list_sources_aggregates` red. Document `source_id`s on the other hand identify a single immutable file (one notion page, one drive doc), so the gate stays in place for `ingest_document`. Chat / email keep their existing chunk-level idempotency (`chunk_id` includes content) which already swallows true replays.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 120-124: The debug logs emit raw source_id; update the logging in
the ingest flow (e.g., where already_ingested(...) is checked and the other
debug at lines ~223-225) to log an opaque value instead: compute a short
deterministic hash or redacted token from source_id (e.g., SHA256 and take first
N chars) and use that hashed_id in log messages instead of the raw source_id;
apply the same change for both occurrences (refer to already_ingested(...) call
and the other duplicate-path debug log) so logs contain only the hashed/opaque
correlation id.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba4e0341-cad1-4c60-9b29-9e6ee95b04a1

📥 Commits

Reviewing files that changed from the base of the PR and between 9fbdf0a and 6e773e0.

📒 Files selected for processing (1)

src/openhuman/memory/tree/ingest.rs

Per CodeRabbit (and existing convention in `chunker.rs` / `composio/providers/gmail/ingest.rs`), raw `source_id` values are recoverable inputs and should not be emitted to logs. Switch the two new duplicate-path debug lines in `ingest.rs` to log `source_id_hash=<redact()>` instead.

coderabbitai

🧹 Nitpick comments (1)

src/openhuman/memory/tree/ingest.rs (1)
300-309: 💤 Low value

LGTM — Race-loss handling is correct.

Not waking workers on the "already ingested" path is appropriate since no jobs were enqueued.

One minor consideration: if the pre-check passes but the transactional claim fails (rare concurrent ingest race), the staged content files from line 172 remain orphaned on disk. This is low severity given the rarity of the race and could be addressed with a periodic GC pass if needed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/ingest.rs` around lines 300 - 309, When the
transactional claim loses (the match on written yields None and returns
Ok(IngestResult::already_ingested(source_id))), the staged content files created
earlier in this function remain orphaned; before returning, delete those staged
files. Locate the staging step used earlier in this function (the variable/paths
holding the staged content) and add a cleanup call to remove those files or
atomically roll back staging, then return
Ok(IngestResult::already_ingested(source_id)); keep the jobs::wake_workers()
behavior unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 300-309: When the transactional claim loses (the match on written
yields None and returns Ok(IngestResult::already_ingested(source_id))), the
staged content files created earlier in this function remain orphaned; before
returning, delete those staged files. Locate the staging step used earlier in
this function (the variable/paths holding the staged content) and add a cleanup
call to remove those files or atomically roll back staging, then return
Ok(IngestResult::already_ingested(source_id)); keep the jobs::wake_workers()
behavior unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ccd3a7c1-222f-42f6-93f6-dc2c99f59ab8

📥 Commits

Reviewing files that changed from the base of the PR and between 6e773e0 and f5e02dd.

📒 Files selected for processing (1)

src/openhuman/memory/tree/ingest.rs

…ees a source twice (tinyhumansai#1353)

senamakel requested a review from a team May 8, 2026 01:13

senamakel added 2 commits May 7, 2026 18:18

chore(format): cargo fmt

9fbdf0a

senamakel force-pushed the fix/memory-perm branch from cd7b29a to 9fbdf0a Compare May 8, 2026 01:19

coderabbitai Bot previously approved these changes May 8, 2026

View reviewed changes

senamakel dismissed coderabbitai[bot]’s stale review via 6e773e0 May 8, 2026 02:08

coderabbitai Bot requested changes May 8, 2026

View reviewed changes

Comment thread src/openhuman/memory/tree/ingest.rs

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

coderabbitai Bot previously approved these changes May 8, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix/memory-perm

4a8cedb

senamakel dismissed coderabbitai[bot]’s stale review via 4a8cedb May 8, 2026 02:26

senamakel merged commit 0bc7457 into tinyhumansai:main May 8, 2026
18 checks passed

AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026

fix(memory_tree): gate ingest on source_id so summariser tree never s…

c350373

…ees a source twice (tinyhumansai#1353)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory_tree): gate ingest on source_id so summariser tree never sees a source twice#1353

fix(memory_tree): gate ingest on source_id so summariser tree never sees a source twice#1353
senamakel merged 5 commits into
tinyhumansai:mainfrom
senamakel:fix/memory-perm

senamakel commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

senamakel commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

senamakel commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading