Skip to content

Remove 5 narrowMatch rows where auto-classifier produced unrelated targets#2

Merged
realmarcin merged 2 commits intomainfrom
fix/remove-bad-narrow-match-rows-pr558
May 3, 2026
Merged

Remove 5 narrowMatch rows where auto-classifier produced unrelated targets#2
realmarcin merged 2 commits intomainfrom
fix/remove-bad-narrow-match-rows-pr558

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

The auto_classify_ingredient_type pipeline emitted 5 skos:narrowMatch rows where the chemistry on both sides is unrelated. The CONSIDER_SPECIFIC notes on each row already documented the divergence; this PR removes them so they can no longer be picked up as parent-of relationships by downstream SSSOM consumers.

Removed rows

MIM subject Bad target Why
MIM:Kh2po4 (KH2PO4) CHEBI:32583 calcium sulfate dihydrate Wrong cation AND wrong anion
MIM:Mncl2_X_2_H2o (MnCl2·2H2O) CHEBI:30200 kaempferol 3-O-β-D-glucoside Manganese chloride vs flavonoid glycoside
MIM:Mncl2_X_4_H2o (MnCl2·4H2O) CHEBI:30200 Same as above
MIM:Mncl2_anhydrous (MnCl2) CHEBI:30200 Same as above
MIM:D-Maltose_Monohydrate CHEBI:233428 5-(N-ethyl-N-isopropyl) amiloride CHEBI via PubChem xref hit an unrelated molecule

Also updated the comment on the kept MIM:D-Maltose_Monohydrate exactMatch cas:6363-53-7 row to drop the now-bogus reference to CHEBI:233428 as the "parent".

Discovery

Found by Codex adversarial review of Knowledge-Graph-Hub/kg-microbe#558. The kg-microbe consolidator (scripts/consolidate_chemical_mappings.py) added a KNOWN_BAD_NARROWMATCH filter as a safety net in commit 7bc3fd72 — once this PR merges, that filter becomes redundant and can be removed in a follow-up.

Follow-up curation

The MnCl2_* and KH2PO4 ingredients no longer appear in MIM after this PR (they had no other rows). A subsequent curation pass should re-add them with the correct CHEBI parents:

  • KH2PO4 → CHEBI:63036 (potassium dihydrogen phosphate)
  • MnCl2 → CHEBI:63041 (manganese(II) chloride)
  • MnCl2·2H2O → CHEBI:191374 (manganese(II) chloride dihydrate)
  • MnCl2·4H2O → CHEBI:74489 (manganese(II) chloride tetrahydrate)

Test plan

  • 5 rows removed from mappings/ingredient_mappings.sssom.tsv
  • No other rows for the affected MIM:* subjects exist in the file (verified via grep)
  • D-Maltose_Monohydrate exactMatch row to cas:6363-53-7 retained (the cas xref is correct; only the bogus parent reference was bad)

🤖 Generated with Claude Code

…rgets

The auto_classify_ingredient_type pipeline emitted 5 skos:narrowMatch
rows where the chemistry on both sides is unrelated. The CONSIDER_SPECIFIC
notes on those rows already documented the divergence; this commit
removes them so the rows can no longer poison downstream consumers
that consume MIM narrowMatch as parent-of relationships.

Removed:
* MIM:Kh2po4 → CHEBI:32583
  KH2PO4 (potassium dihydrogen phosphate) → calcium sulfate dihydrate.
  Wrong cation, wrong anion.

* MIM:Mncl2_X_2_H2o → CHEBI:30200
* MIM:Mncl2_X_4_H2o → CHEBI:30200
* MIM:Mncl2_anhydrous → CHEBI:30200
  All three MnCl2 forms (manganese chloride) → kaempferol 3-O-beta-D-
  glucoside (a flavonoid glycoside). Completely unrelated chemistry.

* MIM:D-Maltose_Monohydrate → CHEBI:233428
  D-Maltose monohydrate → 5-(N-ethyl-N-isopropyl) amiloride. The CHEBI
  via PubChem xref hit a different molecule entirely.

Also updated the comment on the kept
MIM:D-Maltose_Monohydrate exactMatch cas:6363-53-7 row to drop the
now-bogus reference to CHEBI:233428 as the "parent".

Discovered via: kg-microbe Codex adversarial review of
Knowledge-Graph-Hub/kg-microbe#558. The kg-microbe consolidator
(scripts/consolidate_chemical_mappings.py) added a KNOWN_BAD_NARROWMATCH
filter as a safety net for these 5 specific (subject_id, object_id)
pairs in commit 7bc3fd72; once this MIM PR merges, that filter
becomes redundant.

The MnCl2_* and Kh2po4 ingredients no longer appear in MIM after this
commit (they had no other rows). A follow-up curation pass should
re-add them with the correct CHEBI parents:
  - KH2PO4 → CHEBI:63036 (potassium dihydrogen phosphate)
  - MnCl2 → CHEBI:63041 (manganese(II) chloride)
  - MnCl2·2H2O → CHEBI:191374 (manganese(II) chloride dihydrate)
  - MnCl2·4H2O → CHEBI:74489 (manganese(II) chloride tetrahydrate)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 3, 2026 01:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes a small set of erroneous skos:narrowMatch mappings from the MIM → CHEBI ingredient SSSOM mapping set to prevent downstream consumers from inferring incorrect parent/child relationships.

Changes:

  • Removed 5 incorrect skos:narrowMatch rows where the target chemistry is unrelated.
  • Updated the remaining MIM:D-Maltose_Monohydrate CAS skos:exactMatch row comment to remove the (now deleted) bogus CHEBI parent reference.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MIM:D-Lysine D-Lysine skos:exactMatch CHEBI:16855 D-lysine obo:chebi.owl semapv:LexicalMatching MIM:CultureBotHT|MIM:curator=auto_classify_ingredient_type 2026-05-02 0.99
MIM:D-Maltose_Monohydrate D-Maltose monohydrate skos:narrowMatch CHEBI:233428 5-(N-ethyl-N-isopropyl) amiloride obo:chebi.owl semapv:ManualMappingCuration MIM:CultureBotHT|MIM:CHEBI via PubChem (pubchem-xref)|MIM:curator=backfill_parent_terms 2026-05-02 0.9 Maltose monohydrate
MIM:D-Maltose_Monohydrate D-Maltose monohydrate skos:exactMatch cas:6363-53-7 D-Maltose monohydrate registry:cas semapv:ManualMappingCuration MIM:CultureBotHT|MIM:CHEBI via PubChem (pubchem-xref)|MIM:curator=backfill_parent_terms 2026-05-02 0.99 Registry/identity row preserving cas:6363-53-7 alongside parent CHEBI:233428.
MIM:D-Maltose_Monohydrate D-Maltose monohydrate skos:exactMatch cas:6363-53-7 D-Maltose monohydrate registry:cas semapv:ManualMappingCuration MIM:CultureBotHT|MIM:CHEBI via PubChem (pubchem-xref)|MIM:curator=backfill_parent_terms 2026-05-02 0.99 Registry/identity row for D-Maltose monohydrate; cas:6363-53-7 is the canonical CAS RN. (Bogus parent CHEBI:233428 reference removed; see PR fix/remove-bad-narrow-match-rows-pr558.)
…lap)

Hardening pass for the Codex-#558 review findings, scoped to Rule A
("auto-classifier rows whose subject_label and object_label have zero
token overlap"). Companion to commit 887ee9f, which removed the 5
historical KNOWN_BAD_NARROWMATCH rows (KH2PO4, MnCl2_*, D-Maltose);
this commit ships the regression gate that prevents the same bug
class from re-entering the SSSOM.

scripts/validate_sssom_invariants.py (new, 317 lines, stdlib-only):
  csv + argparse + re. No sssom-py dependency (kept lightweight for
  CI). Mirrors the structure + exit-code convention of
  kg-microbe/mappings/validate_isolation_source_mappings.py:
    0 = pass, 1 = missing input, 2 = violation (CI-blocking).

  Rule A: For every row whose `source` column contains ONLY the auto
  curator tags (MIM:curator=auto_classify_ingredient_type or
  MIM:curator=backfill_parent_terms), accept iff at least one of:
    * confidence ≥ 0.95
    * ≥ 1 shared significant token between subject_label and
      object_label (lowercase, ≥ 3 chars, stop-words removed —
      _tokens() + _STOP_TOKENS copied verbatim from
      culturebotai-claw/scripts/foodon_pass.py:64-105)
    * any non-curator-auto tag in the source column (i.e. a human
      curator pipeline like MIM:cbclaw_review_fix, MIM:CultureBotHT,
      MIM:CultureMech touched it)
    * the row's subject's MIM YAML carries an independent
      chemical_properties.cas_rn or pubchem_cid (also accepts the
      legacy `pubchem` key per the original spec) — external registry
      corroboration tier

  Rejected rows go to mappings/needs_curator_review.tsv with a
  trailing reject_reason column. `# TODO PR2` markers reserve the
  Rule B1-B4 slots for the structural-invariants pass.

mappings/needs_curator_review.tsv (new, 36 lines, header-only):
  Pre-populated by running the validator against the current SSSOM.
  Empty by design — every remaining auto-classifier row in the live
  SSSOM carries at least one human-curator co-tag, and the 5
  historical bad rows were already removed in 887ee9f. The validator
  is now a pure regression gate; this file fills as future violations
  arrive.

.github/workflows/qc-sssom.yaml (new, 41 lines):
  Triggers on pull_request [main] and push [main] with paths matching
  the SSSOM TSV, the triage TSV, or the validator script. Python
  3.13. Uploads needs_curator_review.tsv as an artifact when present
  for curator triage.

justfile (modified, +9 / -2):
  + qc-sssom recipe (just python3 scripts/validate_sssom_invariants.py)
  Extended `qc:` composite from `qc: validate-all qc-evidence` to
  `qc: validate-all qc-evidence qc-sssom` so the full QC sweep
  exercises the new gate.

Smoke tests
  - python3 scripts/validate_sssom_invariants.py against the current
    SSSOM → exit 0, header-only triage TSV.
  - just qc-sssom → same, exit 0.
  - Synthetic mutation (MIM:2-Sulfobenzoic_Acid forced to
    auto-classifier-only source + unrelated object_label + low
    confidence) → exit 2; per-row stderr report names subject and
    rejection tier.

Once this lands, kg-microbe's KNOWN_BAD_NARROWMATCH filter at
consolidate_chemical_mappings.py:1211-1217 becomes redundant —
follow-up downstream PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit f84f053 into main May 3, 2026
2 checks passed
@realmarcin realmarcin deleted the fix/remove-bad-narrow-match-rows-pr558 branch May 3, 2026 04:17
realmarcin added a commit that referenced this pull request May 3, 2026
D-Maltose_Monohydrate.yaml previously carried CHEBI:233428 (5-(N-ethyl-
N-isopropyl) amiloride) as its ontology_mapping.ontology_id — a stale
PubChem-xref hit on PubChem CID 23615261 that landed on a completely
unrelated molecule. The bogus row was hand-removed from the published
SSSOM in PR #2 (commit 887ee9f) but the YAML still contained the wrong
CHEBI, so a future build_mim_ingredient_sssom would have re-emitted it.

Repointed at CHEBI:17306 (maltose) as the correct narrowMatch parent;
D-maltose monohydrate is a hydrate of D-maltose, and CHEBI has no
specific term for the monohydrate form. The cas:6363-53-7 identifier
remains the canonical primary (registry/identity row preserved via
the dual-emission path).

Also republishing mappings/ingredient_mappings.sssom.tsv via:
  python3 scripts/build_mim_ingredient_sssom.py
  python3 scripts/publish_sssom.py --apply

Net delta vs the post-PR-#2 state: +5 rows — the same 5 subjects that
were hand-removed are back, this time with CORRECT CHEBI mappings:
  MIM:Kh2po4              → CHEBI:63036  (potassium dihydrogen phosphate)
  MIM:Mncl2_X_2_H2o       → CHEBI:131395 (manganese(II) chloride dihydrate)
  MIM:Mncl2_X_4_H2o       → CHEBI:86368  (manganese(II) chloride tetrahydrate)
  MIM:Mncl2_anhydrous     → CHEBI:63041  (manganese(II) chloride)
  MIM:D-Maltose_Monohydrate → CHEBI:17306 (maltose, narrowMatch) +
                              cas:6363-53-7 (exactMatch dual)

Made possible by the upstream defensive token-overlap gate added in
culturebotai-claw commit (build_mim_ingredient_sssom: defensive
token-overlap gate on CONSIDER_SPECIFIC). The build now refuses to
honor a residual-P2.5 CONSIDER_SPECIFIC override when MIM's label
and kg-microbe's label share zero significant tokens, so the 4
MnCl2/Kh2po4 ingredients (whose YAMLs already had correct CHEBIs)
no longer get flipped to kg-microbe's wrong ones.

Validator state on the new SSSOM:
  python3 scripts/validate_sssom_invariants.py → exit 0 (Rule A clean)
  sssom validate → OK

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants