Skip to content

fix(memory_tree): gate ingest on source_id so summariser tree never sees a source twice#1353

Merged
senamakel merged 5 commits into
tinyhumansai:mainfrom
senamakel:fix/memory-perm
May 8, 2026
Merged

fix(memory_tree): gate ingest on source_id so summariser tree never sees a source twice#1353
senamakel merged 5 commits into
tinyhumansai:mainfrom
senamakel:fix/memory-perm

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented May 8, 2026

Summary

  • Add a source-level idempotency gate to the memory-tree ingest pipeline so the summariser tree can't see the same (source_kind, source_id) twice.
  • New mem_tree_ingested_sources table claims a source on first ingest; subsequent ingest_chat / ingest_email / ingest_document calls short-circuit.
  • Authoritative claim runs inside the same transaction as chunk / score / job writes, so two concurrent ingests of the same source can't both pass.

Problem

Memory items (documents, chat batches, email threads) are append-only — once a source has been ingested, the file is never updated, only added to. But the existing chunk-level idempotency only catches identical (source_kind, source_id, seq, content) triples. If the same logical source is ingested twice through any path that yields different chunk content (whitespace drift, re-canonicalisation, partial replay) it flows back through extract → admit → buffer → seal, duplicating the same content into the summariser tree. We were observing duplicates in the graph as a result.

Solution

  • src/openhuman/memory/tree/store.rs — new mem_tree_ingested_sources table keyed on (source_kind, source_id), plus is_source_ingested (best-effort lookup) and claim_source_ingest_tx (transactional INSERT OR IGNORE).
  • src/openhuman/memory/tree/ingest.rs
    • Each ingest_* entry point checks is_source_ingested before canonicalisation and short-circuits on hit.
    • Inside persist's transaction, claim_source_ingest_tx is the authoritative gate. If the row already exists the closure returns early and nothing is committed.
    • IngestResult got a new already_ingested: bool field (defaulted via serde for wire compatibility).
  • src/openhuman/memory/slack_ingestion/ops.rs — updated empty-bucket short-circuit for the new field.
  • New test second_ingest_of_same_source_id_is_short_circuited proves a second ingest_document under the same source_id (even with different body) writes nothing.

Trade-off: documents are append-only by design, so the gate uses source_id alone — even mutated bodies under the same id are rejected. That matches the data model and is what we want for the summariser tree.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per docs/TESTING-STRATEGY.md
  • N/A: backend-only Rust change; coverage gate measured by CI on changed Rust lines via cargo-llvm-cov
  • N/A: behaviour-only change to existing memory-tree feature
  • N/A: no new feature IDs introduced
  • No new external network dependencies introduced (mock backend used per docs/TESTING-STRATEGY.md)
  • N/A: does not touch release-cut surfaces
  • N/A: no linked issue

Impact

  • Runtime: desktop core (Rust). One additive SQLite table on the existing chunks.db. Schema is created via CREATE TABLE IF NOT EXISTS so existing workspaces upgrade transparently.
  • Behaviour change: a re-ingest of an already-ingested (source_kind, source_id) is now a no-op returning IngestResult { already_ingested: true, .. }. Callers that today re-drove ingest as a way to refresh content will need a different mechanism — but per the data model, "memory items are final once ingested", so this is the intended contract.
  • Performance: extra cheap SQLite lookup at the head of each ingest_*; saves all downstream LLM extraction cost on duplicates.
  • Security / migration / compatibility: none beyond the new table.

Related

  • Closes:
  • Follow-up PR(s)/TODOs:

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/memory-perm
  • Commit SHA: cd7b29a56f2c9ac868c3bece2738e20ee869d7ae

Validation Run

  • pnpm --filter openhuman-app format:check
  • pnpm typecheck
  • Focused tests: cargo test --lib openhuman::memory::tree::ingest:: — 5/5 passing
  • Rust fmt/check (if changed): cargo fmt --check + cargo check --manifest-path Cargo.toml
  • N/A: Tauri shell not touched

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: re-ingest of the same (source_kind, source_id) becomes a no-op.
  • User-visible effect: the summariser tree no longer accumulates duplicate content for sources replayed through ingest.

Parity Contract

  • Legacy behavior preserved: chunk-level idempotency guard inside persist is unchanged; the new gate sits in front of it.
  • Guard/fallback/dispatch parity checks: IngestResult got a new field with #[serde(default)] so old callers / persisted JSON deserialise unchanged.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: N/A
  • Resolution (closed/superseded/updated): N/A

Summary by CodeRabbit

  • New Features

    • Source-level deduplication for documents: re-submitted documents are detected and skipped early.
    • Ingest results now explicitly indicate when a source was already processed.
  • Bug Fixes

    • Prevents duplicate storage, chunk writes, and extraction-job enqueueing for already-ingested sources.

@senamakel senamakel requested a review from a team May 8, 2026 01:13
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds source-level deduplication for document ingests: new ingested-sources table, public pre-check and transactional claim APIs, IngestResult.already_ingested field and constructors, early pre-check short-circuit for documents, transactional gate in persist, tests, and a Slack empty-bucket result update.

Changes

Source-level deduplication in memory ingestion

Layer / File(s) Summary
Database schema
src/openhuman/memory/tree/store.rs
New mem_tree_ingested_sources table with (source_kind, source_id) primary key and ingested_at_ms.
Storage deduplication APIs
src/openhuman/memory/tree/store.rs
Adds is_source_ingested (best-effort SELECT) and claim_source_ingest_tx (transactional INSERT OR IGNORE) to check and claim source ingests.
IngestResult shape & constructors
src/openhuman/memory/tree/ingest.rs
Adds pub already_ingested: bool (#[serde(default)]) and constructors for already-ingested vs. normal results.
Early dedup checks
src/openhuman/memory/tree/ingest.rs
Adds already_ingested helper (spawn_blocking -> store::is_source_ingested) and a document-only early-return that short-circuits canonicalisation/chunking/persist when true.
Transactional persistence gating
src/openhuman/memory/tree/ingest.rs
persist uses claim_source_ingest_tx inside the DB transaction; on claim failure returns Ok(None) to skip chunk upserts/enqueue and maps that to IngestResult::already_ingested.
Tests
src/openhuman/memory/tree/ingest.rs
Adds test second_ingest_of_same_source_id_is_short_circuited asserting second ingest returns already_ingested, writes 0 chunks, and store retains only first ingest's chunks after drain.
Slack integration
src/openhuman/memory/slack_ingestion/ops.rs
Empty-bucket ingest_bucket early-return now explicitly sets already_ingested: false with zeroed chunk counts and empty chunk_ids.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • graycyrus

Poem

🐰 I nibble logs and mark the seeds sown,
A table remembers what once was known.
Claims and checks hop in tidy rows,
No twin data sprouts where the true source goes.
Hooray — one-and-done for each little tone!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding source-level idempotency gating to prevent the summariser tree from seeing duplicate sources.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

senamakel added 2 commits May 7, 2026 18:18
…ees a source twice

Memory items (documents, chat batches, email threads) are append-only —
once `(source_kind, source_id)` is ingested, re-ingesting must not flow
through extract → admit → buffer → seal again, otherwise the same
content lands in the summariser tree twice.

- New `mem_tree_ingested_sources` table keyed on `(source_kind, source_id)`.
- `ingest_chat` / `ingest_email` / `ingest_document` short-circuit on the
  fast-path lookup before canonicalisation.
- `persist` claims the row inside the same transaction as the chunk /
  score / job writes via `INSERT OR IGNORE`, so two concurrent ingests
  of the same source can't both pass the gate.
- `IngestResult.already_ingested` surfaces the no-op to callers.
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 8, 2026
Chat (`slack:{conn}`) and email (`gmail:{participants}`) `source_id`s
are stream identifiers — many batches / threads accumulate under one
source over time. The previous source-level gate made every bucket
after the first a no-op, breaking the slack workspace tree fill /
seal cascade and the gmail per-participant append flow, and turning
`read_rpc::tests::list_sources_aggregates` red.

Document `source_id`s on the other hand identify a single immutable
file (one notion page, one drive doc), so the gate stays in place
for `ingest_document`. Chat / email keep their existing chunk-level
idempotency (`chunk_id` includes content) which already swallows true
replays.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 120-124: The debug logs emit raw source_id; update the logging in
the ingest flow (e.g., where already_ingested(...) is checked and the other
debug at lines ~223-225) to log an opaque value instead: compute a short
deterministic hash or redacted token from source_id (e.g., SHA256 and take first
N chars) and use that hashed_id in log messages instead of the raw source_id;
apply the same change for both occurrences (refer to already_ingested(...) call
and the other duplicate-path debug log) so logs contain only the hashed/opaque
correlation id.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba4e0341-cad1-4c60-9b29-9e6ee95b04a1

📥 Commits

Reviewing files that changed from the base of the PR and between 9fbdf0a and 6e773e0.

📒 Files selected for processing (1)
  • src/openhuman/memory/tree/ingest.rs

Comment thread src/openhuman/memory/tree/ingest.rs
Per CodeRabbit (and existing convention in `chunker.rs` /
`composio/providers/gmail/ingest.rs`), raw `source_id` values are
recoverable inputs and should not be emitted to logs. Switch the
two new duplicate-path debug lines in `ingest.rs` to log
`source_id_hash=<redact()>` instead.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/openhuman/memory/tree/ingest.rs (1)

300-309: 💤 Low value

LGTM — Race-loss handling is correct.

Not waking workers on the "already ingested" path is appropriate since no jobs were enqueued.

One minor consideration: if the pre-check passes but the transactional claim fails (rare concurrent ingest race), the staged content files from line 172 remain orphaned on disk. This is low severity given the rarity of the race and could be addressed with a periodic GC pass if needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/ingest.rs` around lines 300 - 309, When the
transactional claim loses (the match on written yields None and returns
Ok(IngestResult::already_ingested(source_id))), the staged content files created
earlier in this function remain orphaned; before returning, delete those staged
files. Locate the staging step used earlier in this function (the variable/paths
holding the staged content) and add a cleanup call to remove those files or
atomically roll back staging, then return
Ok(IngestResult::already_ingested(source_id)); keep the jobs::wake_workers()
behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 300-309: When the transactional claim loses (the match on written
yields None and returns Ok(IngestResult::already_ingested(source_id))), the
staged content files created earlier in this function remain orphaned; before
returning, delete those staged files. Locate the staging step used earlier in
this function (the variable/paths holding the staged content) and add a cleanup
call to remove those files or atomically roll back staging, then return
Ok(IngestResult::already_ingested(source_id)); keep the jobs::wake_workers()
behavior unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ccd3a7c1-222f-42f6-93f6-dc2c99f59ab8

📥 Commits

Reviewing files that changed from the base of the PR and between 6e773e0 and f5e02dd.

📒 Files selected for processing (1)
  • src/openhuman/memory/tree/ingest.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 8, 2026
@senamakel senamakel merged commit 0bc7457 into tinyhumansai:main May 8, 2026
18 checks passed
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant