Skip to content

fix: dedup false positives on scientific and domain-specific text#25

Merged
Grivn merged 4 commits into
mnemon-dev:masterfrom
chancsc:fix/dedup-false-positives
May 17, 2026
Merged

fix: dedup false positives on scientific and domain-specific text#25
Grivn merged 4 commits into
mnemon-dev:masterfrom
chancsc:fix/dedup-false-positives

Conversation

@chancsc
Copy link
Copy Markdown
Contributor

@chancsc chancsc commented May 17, 2026

Summary
Three related bugs in internal/search/diff.go caused mnemon remember to silently corrupt memory when storing scientific, survey, or any domain-repetitive text — replacing a valid existing record with new content even though the two records described different facts.

All fixes include regression tests.


Bug 1 — Bare "not" in negation word list
negationWords included the single word "not". In scientific and research text this word appears constantly:
"species not previously recorded", "population not observed during the dry season"

Any two records containing "not" with similarity ≥ 0.5 were classified as CONFLICT, causing the existing record to be overwritten.

Fix: Remove bare "not" (and bare "不" in Chinese). Only multi-word, unambiguous state-change phrases remain: "no longer", "replaced", "switched from", etc.


Bug 2 — Negation check fires at similarity ≥ 0.5 (too low)
The negation scan ran at the ADD threshold (0.5). At borderline similarity, texts share domain vocabulary — same species, same field, same standard phrasing — without being about the same subject. A negation phrase in an unrelated sentence triggered CONFLICT.

Fix: Gate the negation check at similarity ≥ 0.7. Below that, shared vocabulary is noise, not a conflict signal.


Bug 3 — Cosine dedup threshold too low for domain-dense embeddings (f78165f)
nomic-embed-text produces cosine similarity ~0.75 for same-domain different-fact pairs (e.g. two butterfly survey records at different locations). The old threshold of 0.70 let cosine override token similarity and classify distinct records as UPDATE.

Fix: Raise cosine threshold from 0.70 → 0.85. At that level, texts are genuinely near-identical.


Bug 4 — Token similarity uses bidirectional max, too sensitive for formulaic text (e22d67d)
ContentSimilarity computes max(forward, backward) overlap — formulaic scientific records sharing a species name and standard phrasing scored ~0.5, crossing the ADD threshold and triggering UPDATE.

Fix: Use Jaccard similarity (|A∩B|/|A∪B|) for dedup. It penalises texts that share vocabulary but differ in most tokens. Same-domain different-location pairs score ~0.28 (below ADD threshold); genuine one-word-change updates (SQLite→PostgreSQL) still score ~0.6 (UPDATE as intended). ContentSimilarity is unchanged — bidirectional max remains correct for recall and keyword search ranking.


Reproduction
'''
mnemon remember "Rajah Brooke's Birdwing observed at Kinabalu Park. Species not previously recorded below 1000m."
mnemon remember "Rajah Brooke's Birdwing survey at Raub, Pahang. Population not observed during dry season." '''


Files changed
internal/search/diff.go
Remove "not" / "不" from negation list; gate negation at ≥ 0.7; raise cosine threshold to 0.85; switch token dedup to Jaccard

internal/search/keyword.go
Add JaccardSimilarity()

internal/search/diff_test.go
Regression tests for all three diff.go changes

internal/search/keyword_test.go
Unit tests for JaccardSimilarity

chancsc and others added 4 commits May 17, 2026 09:57
nomic-embed-text produces cosine ~0.75 for same-domain different-fact pairs
(e.g. two butterfly survey records at different locations). The old threshold
of 0.70 let cosine override token similarity, incorrectly classifying distinct
insights as UPDATE and replacing the original. Raising to 0.85 ensures cosine
only confirms deduplication when texts are genuinely near-identical.

Adds regression test with controlled 0.75-cosine fake embeddings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ContentSimilarity (bidirectional max) was too sensitive for formulaic
scientific records: a Raub butterfly entry sharing the species name and
standard phrasing with a Kinabalu entry produced tokenSim=0.5, crossing
the UPDATE threshold and replacing the original.

Jaccard (|A∩B|/|A∪B|) penalises texts that share domain vocabulary but
have many distinct tokens (different facts). Same-domain different-location
pairs now score ~0.28, falling below the 0.5 ADD threshold. Genuine
one-word-change updates (SQLite→PostgreSQL) still score ~0.6 → UPDATE.

ContentSimilarity is unchanged — bidirectional max remains correct for
recall and keyword search.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity>=0.7

Two bugs caused CONFLICT false positives on butterfly survey data:

1. "not" in negationWords fires on virtually all scientific text
   ("species not previously recorded", "not endemic to region").
   Removed: only multi-word state-change phrases remain as signals.

2. Negation check fired at similarity>=0.5. At borderline similarity,
   texts share domain vocabulary without being about the same subject.
   Now only checked when similarity>=0.7.

Also updates guide.md: PDF/external-document facts must use --no-diff
since each document is a distinct authoritative source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Grivn
Copy link
Copy Markdown
Member

Grivn commented May 17, 2026

LGTM. Thanks for the fix!

One small thing worth considering after the Jaccard change: Diff still uses matches[0].Suggestion, while matches are ordered by keyword score rather than final Similarity. So a lower-similarity ADD could potentially mask a later UPDATE/DUPLICATE.

I’m okay with merging this PR as-is and handling the selection logic in a follow-up.

@Grivn Grivn merged commit bd0fbe9 into mnemon-dev:master May 17, 2026
1 check passed
@chancsc
Copy link
Copy Markdown
Contributor Author

chancsc commented May 17, 2026

Thanks, will check your feedback and open another PR later

@chancsc chancsc deleted the fix/dedup-false-positives branch May 18, 2026 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants