fix: dedup false positives on scientific and domain-specific text#25
Merged
Conversation
nomic-embed-text produces cosine ~0.75 for same-domain different-fact pairs (e.g. two butterfly survey records at different locations). The old threshold of 0.70 let cosine override token similarity, incorrectly classifying distinct insights as UPDATE and replacing the original. Raising to 0.85 ensures cosine only confirms deduplication when texts are genuinely near-identical. Adds regression test with controlled 0.75-cosine fake embeddings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ContentSimilarity (bidirectional max) was too sensitive for formulaic scientific records: a Raub butterfly entry sharing the species name and standard phrasing with a Kinabalu entry produced tokenSim=0.5, crossing the UPDATE threshold and replacing the original. Jaccard (|A∩B|/|A∪B|) penalises texts that share domain vocabulary but have many distinct tokens (different facts). Same-domain different-location pairs now score ~0.28, falling below the 0.5 ADD threshold. Genuine one-word-change updates (SQLite→PostgreSQL) still score ~0.6 → UPDATE. ContentSimilarity is unchanged — bidirectional max remains correct for recall and keyword search. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity>=0.7
Two bugs caused CONFLICT false positives on butterfly survey data:
1. "not" in negationWords fires on virtually all scientific text
("species not previously recorded", "not endemic to region").
Removed: only multi-word state-change phrases remain as signals.
2. Negation check fired at similarity>=0.5. At borderline similarity,
texts share domain vocabulary without being about the same subject.
Now only checked when similarity>=0.7.
Also updates guide.md: PDF/external-document facts must use --no-diff
since each document is a distinct authoritative source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Member
|
LGTM. Thanks for the fix! One small thing worth considering after the Jaccard change: I’m okay with merging this PR as-is and handling the selection logic in a follow-up. |
Contributor
Author
|
Thanks, will check your feedback and open another PR later |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related bugs in internal/search/diff.go caused mnemon remember to silently corrupt memory when storing scientific, survey, or any domain-repetitive text — replacing a valid existing record with new content even though the two records described different facts.
All fixes include regression tests.
Bug 1 — Bare "not" in negation word list
negationWords included the single word "not". In scientific and research text this word appears constantly:
"species not previously recorded", "population not observed during the dry season"
Any two records containing "not" with similarity ≥ 0.5 were classified as CONFLICT, causing the existing record to be overwritten.
Fix: Remove bare "not" (and bare "不" in Chinese). Only multi-word, unambiguous state-change phrases remain: "no longer", "replaced", "switched from", etc.
Bug 2 — Negation check fires at similarity ≥ 0.5 (too low)
The negation scan ran at the ADD threshold (0.5). At borderline similarity, texts share domain vocabulary — same species, same field, same standard phrasing — without being about the same subject. A negation phrase in an unrelated sentence triggered CONFLICT.
Fix: Gate the negation check at similarity ≥ 0.7. Below that, shared vocabulary is noise, not a conflict signal.
Bug 3 — Cosine dedup threshold too low for domain-dense embeddings (f78165f)
nomic-embed-text produces cosine similarity ~0.75 for same-domain different-fact pairs (e.g. two butterfly survey records at different locations). The old threshold of 0.70 let cosine override token similarity and classify distinct records as UPDATE.
Fix: Raise cosine threshold from 0.70 → 0.85. At that level, texts are genuinely near-identical.
Bug 4 — Token similarity uses bidirectional max, too sensitive for formulaic text (e22d67d)
ContentSimilarity computes max(forward, backward) overlap — formulaic scientific records sharing a species name and standard phrasing scored ~0.5, crossing the ADD threshold and triggering UPDATE.
Fix: Use Jaccard similarity (|A∩B|/|A∪B|) for dedup. It penalises texts that share vocabulary but differ in most tokens. Same-domain different-location pairs score ~0.28 (below ADD threshold); genuine one-word-change updates (SQLite→PostgreSQL) still score ~0.6 (UPDATE as intended). ContentSimilarity is unchanged — bidirectional max remains correct for recall and keyword search ranking.
Reproduction
'''
mnemon remember "Rajah Brooke's Birdwing observed at Kinabalu Park. Species not previously recorded below 1000m."
mnemon remember "Rajah Brooke's Birdwing survey at Raub, Pahang. Population not observed during dry season." '''
Files changed
internal/search/diff.go
Remove "not" / "不" from negation list; gate negation at ≥ 0.7; raise cosine threshold to 0.85; switch token dedup to Jaccard
internal/search/keyword.go
Add JaccardSimilarity()
internal/search/diff_test.go
Regression tests for all three diff.go changes
internal/search/keyword_test.go
Unit tests for JaccardSimilarity