False CONFLICT classification on scientific/research text due to bare "not" in negation word list and missing similarity gate

Component: Duplicate detection (internal/search/diff.go — classifySuggestion)

Severity: Medium — silently corrupts memory by replacing a valid existing record with new content

Background
When you run mnemon remember, the dedup system compares the new text against existing records and classifies the result as ADD, UPDATE, or CONFLICT. A CONFLICT causes the existing memory entry to be flagged or replaced. The classification uses two signals: token similarity score and a list of "negation words" (phrases that suggest a contradiction, e.g. "replaced", "deprecated", "no longer").

---

Bug 1: "not" in the negation word list
The word "not" is included in negationWords. In everyday conversational text this is reasonable, but in scientific, research, or survey data it appears in virtually every sentence:
"species not previously recorded"
"not endemic to the region"
"population not observed during the dry season"
Any memory entry containing "not" — even two completely unrelated survey records — will be classified as CONFLICT if their similarity score is above the add-threshold. The existing record gets overwritten even though the two entries describe different facts.
Fix: Remove bare "not" from negationWords. Only multi-word, unambiguous state-change phrases (e.g. "no longer", "replaced", "switched from") reliably signal a contradiction.

---

Bug 2: Negation check fires at similarity ≥ 0.5 (too low)
The negation word scan runs on any text pair whose similarity crosses the add-threshold (0.5). At borderline similarity, two texts may simply share domain vocabulary — same field of study, same species names, same standard phrasing — without actually being about the same subject. Triggering CONFLICT at this level produces false positives on any corpus with repetitive structure (surveys, logs, changelogs, etc.).
Fix: Only run the negation check when similarity ≥ 0.7. At that level the texts are substantially similar and a negation word is a meaningful signal, not noise.

---

Reproduction scenario
Store two butterfly survey records from different locations:

Bash:
mnemon remember "Rajah Brooke's Birdwing observed at Kinabalu Park, Sabah. Species not previously recorded below 1000m elevation."

mnemon remember "Rajah Brooke's Birdwing survey at Raub, Pahang. Population not observed during the dry season survey."


Expected: Both records stored as separate ADD entries (different locations, different facts).

Actual (before fix): The second remember is classified as CONFLICT and overwrites the first, because both texts contain "not" and share species-name tokens pushing similarity above 0.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False CONFLICT classification on scientific/research text due to bare "not" in negation word list and missing similarity gate #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

False CONFLICT classification on scientific/research text due to bare "not" in negation word list and missing similarity gate #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions