fix: dedup false positives on scientific and domain-specific text by chancsc · Pull Request #25 · mnemon-dev/mnemon

chancsc · 2026-05-17T10:32:44Z

Summary
Three related bugs in internal/search/diff.go caused mnemon remember to silently corrupt memory when storing scientific, survey, or any domain-repetitive text — replacing a valid existing record with new content even though the two records described different facts.

All fixes include regression tests.

Bug 1 — Bare "not" in negation word list
negationWords included the single word "not". In scientific and research text this word appears constantly:
"species not previously recorded", "population not observed during the dry season"

Any two records containing "not" with similarity ≥ 0.5 were classified as CONFLICT, causing the existing record to be overwritten.

Fix: Remove bare "not" (and bare "不" in Chinese). Only multi-word, unambiguous state-change phrases remain: "no longer", "replaced", "switched from", etc.

Bug 2 — Negation check fires at similarity ≥ 0.5 (too low)
The negation scan ran at the ADD threshold (0.5). At borderline similarity, texts share domain vocabulary — same species, same field, same standard phrasing — without being about the same subject. A negation phrase in an unrelated sentence triggered CONFLICT.

Fix: Gate the negation check at similarity ≥ 0.7. Below that, shared vocabulary is noise, not a conflict signal.

Bug 3 — Cosine dedup threshold too low for domain-dense embeddings (f78165f)
nomic-embed-text produces cosine similarity ~0.75 for same-domain different-fact pairs (e.g. two butterfly survey records at different locations). The old threshold of 0.70 let cosine override token similarity and classify distinct records as UPDATE.

Fix: Raise cosine threshold from 0.70 → 0.85. At that level, texts are genuinely near-identical.

Bug 4 — Token similarity uses bidirectional max, too sensitive for formulaic text (e22d67d)
ContentSimilarity computes max(forward, backward) overlap — formulaic scientific records sharing a species name and standard phrasing scored ~0.5, crossing the ADD threshold and triggering UPDATE.

Fix: Use Jaccard similarity (|A∩B|/|A∪B|) for dedup. It penalises texts that share vocabulary but differ in most tokens. Same-domain different-location pairs score ~0.28 (below ADD threshold); genuine one-word-change updates (SQLite→PostgreSQL) still score ~0.6 (UPDATE as intended). ContentSimilarity is unchanged — bidirectional max remains correct for recall and keyword search ranking.

Reproduction
'''
mnemon remember "Rajah Brooke's Birdwing observed at Kinabalu Park. Species not previously recorded below 1000m."
mnemon remember "Rajah Brooke's Birdwing survey at Raub, Pahang. Population not observed during dry season." '''

Files changed
internal/search/diff.go
Remove "not" / "不" from negation list; gate negation at ≥ 0.7; raise cosine threshold to 0.85; switch token dedup to Jaccard

internal/search/keyword.go
Add JaccardSimilarity()

internal/search/diff_test.go
Regression tests for all three diff.go changes

internal/search/keyword_test.go
Unit tests for JaccardSimilarity

nomic-embed-text produces cosine ~0.75 for same-domain different-fact pairs (e.g. two butterfly survey records at different locations). The old threshold of 0.70 let cosine override token similarity, incorrectly classifying distinct insights as UPDATE and replacing the original. Raising to 0.85 ensures cosine only confirms deduplication when texts are genuinely near-identical. Adds regression test with controlled 0.75-cosine fake embeddings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ContentSimilarity (bidirectional max) was too sensitive for formulaic scientific records: a Raub butterfly entry sharing the species name and standard phrasing with a Kinabalu entry produced tokenSim=0.5, crossing the UPDATE threshold and replacing the original. Jaccard (|A∩B|/|A∪B|) penalises texts that share domain vocabulary but have many distinct tokens (different facts). Same-domain different-location pairs now score ~0.28, falling below the 0.5 ADD threshold. Genuine one-word-change updates (SQLite→PostgreSQL) still score ~0.6 → UPDATE. ContentSimilarity is unchanged — bidirectional max remains correct for recall and keyword search. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ity>=0.7 Two bugs caused CONFLICT false positives on butterfly survey data: 1. "not" in negationWords fires on virtually all scientific text ("species not previously recorded", "not endemic to region"). Removed: only multi-word state-change phrases remain as signals. 2. Negation check fired at similarity>=0.5. At borderline similarity, texts share domain vocabulary without being about the same subject. Now only checked when similarity>=0.7. Also updates guide.md: PDF/external-document facts must use --no-diff since each document is a distinct authoritative source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Grivn · 2026-05-17T12:50:39Z

LGTM. Thanks for the fix!

One small thing worth considering after the Jaccard change: Diff still uses matches[0].Suggestion, while matches are ordered by keyword score rather than final Similarity. So a lower-similarity ADD could potentially mask a later UPDATE/DUPLICATE.

I’m okay with merging this PR as-is and handling the selection logic in a follow-up.

chancsc · 2026-05-17T12:58:48Z

Thanks, will check your feedback and open another PR later

chancsc and others added 4 commits May 17, 2026 09:57

chore: update CHANGELOG for dedup false-positive fixes

7044690

Grivn merged commit bd0fbe9 into mnemon-dev:master May 17, 2026
1 check passed

Grivn mentioned this pull request May 17, 2026

False CONFLICT classification on scientific/research text due to bare "not" in negation word list and missing similarity gate #23

Closed

chancsc mentioned this pull request May 17, 2026

fix: sort diff matches by similarity before picking overall suggestion #26

Merged

chancsc deleted the fix/dedup-false-positives branch May 18, 2026 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dedup false positives on scientific and domain-specific text#25

fix: dedup false positives on scientific and domain-specific text#25
Grivn merged 4 commits into
mnemon-dev:masterfrom
chancsc:fix/dedup-false-positives

chancsc commented May 17, 2026

Uh oh!

Grivn commented May 17, 2026

Uh oh!

Uh oh!

chancsc commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chancsc commented May 17, 2026

Uh oh!

Grivn commented May 17, 2026

Uh oh!

Uh oh!

chancsc commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants