Remove 5 narrowMatch rows where auto-classifier produced unrelated targets#2
Merged
realmarcin merged 2 commits intomainfrom May 3, 2026
Merged
Conversation
…rgets The auto_classify_ingredient_type pipeline emitted 5 skos:narrowMatch rows where the chemistry on both sides is unrelated. The CONSIDER_SPECIFIC notes on those rows already documented the divergence; this commit removes them so the rows can no longer poison downstream consumers that consume MIM narrowMatch as parent-of relationships. Removed: * MIM:Kh2po4 → CHEBI:32583 KH2PO4 (potassium dihydrogen phosphate) → calcium sulfate dihydrate. Wrong cation, wrong anion. * MIM:Mncl2_X_2_H2o → CHEBI:30200 * MIM:Mncl2_X_4_H2o → CHEBI:30200 * MIM:Mncl2_anhydrous → CHEBI:30200 All three MnCl2 forms (manganese chloride) → kaempferol 3-O-beta-D- glucoside (a flavonoid glycoside). Completely unrelated chemistry. * MIM:D-Maltose_Monohydrate → CHEBI:233428 D-Maltose monohydrate → 5-(N-ethyl-N-isopropyl) amiloride. The CHEBI via PubChem xref hit a different molecule entirely. Also updated the comment on the kept MIM:D-Maltose_Monohydrate exactMatch cas:6363-53-7 row to drop the now-bogus reference to CHEBI:233428 as the "parent". Discovered via: kg-microbe Codex adversarial review of Knowledge-Graph-Hub/kg-microbe#558. The kg-microbe consolidator (scripts/consolidate_chemical_mappings.py) added a KNOWN_BAD_NARROWMATCH filter as a safety net for these 5 specific (subject_id, object_id) pairs in commit 7bc3fd72; once this MIM PR merges, that filter becomes redundant. The MnCl2_* and Kh2po4 ingredients no longer appear in MIM after this commit (they had no other rows). A follow-up curation pass should re-add them with the correct CHEBI parents: - KH2PO4 → CHEBI:63036 (potassium dihydrogen phosphate) - MnCl2 → CHEBI:63041 (manganese(II) chloride) - MnCl2·2H2O → CHEBI:191374 (manganese(II) chloride dihydrate) - MnCl2·4H2O → CHEBI:74489 (manganese(II) chloride tetrahydrate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Removes a small set of erroneous skos:narrowMatch mappings from the MIM → CHEBI ingredient SSSOM mapping set to prevent downstream consumers from inferring incorrect parent/child relationships.
Changes:
- Removed 5 incorrect
skos:narrowMatchrows where the target chemistry is unrelated. - Updated the remaining
MIM:D-Maltose_MonohydrateCASskos:exactMatchrow comment to remove the (now deleted) bogus CHEBI parent reference.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MIM:D-Lysine D-Lysine skos:exactMatch CHEBI:16855 D-lysine obo:chebi.owl semapv:LexicalMatching MIM:CultureBotHT|MIM:curator=auto_classify_ingredient_type 2026-05-02 0.99 | ||
| MIM:D-Maltose_Monohydrate D-Maltose monohydrate skos:narrowMatch CHEBI:233428 5-(N-ethyl-N-isopropyl) amiloride obo:chebi.owl semapv:ManualMappingCuration MIM:CultureBotHT|MIM:CHEBI via PubChem (pubchem-xref)|MIM:curator=backfill_parent_terms 2026-05-02 0.9 Maltose monohydrate | ||
| MIM:D-Maltose_Monohydrate D-Maltose monohydrate skos:exactMatch cas:6363-53-7 D-Maltose monohydrate registry:cas semapv:ManualMappingCuration MIM:CultureBotHT|MIM:CHEBI via PubChem (pubchem-xref)|MIM:curator=backfill_parent_terms 2026-05-02 0.99 Registry/identity row preserving cas:6363-53-7 alongside parent CHEBI:233428. | ||
| MIM:D-Maltose_Monohydrate D-Maltose monohydrate skos:exactMatch cas:6363-53-7 D-Maltose monohydrate registry:cas semapv:ManualMappingCuration MIM:CultureBotHT|MIM:CHEBI via PubChem (pubchem-xref)|MIM:curator=backfill_parent_terms 2026-05-02 0.99 Registry/identity row for D-Maltose monohydrate; cas:6363-53-7 is the canonical CAS RN. (Bogus parent CHEBI:233428 reference removed; see PR fix/remove-bad-narrow-match-rows-pr558.) |
…lap)
Hardening pass for the Codex-#558 review findings, scoped to Rule A
("auto-classifier rows whose subject_label and object_label have zero
token overlap"). Companion to commit 887ee9f, which removed the 5
historical KNOWN_BAD_NARROWMATCH rows (KH2PO4, MnCl2_*, D-Maltose);
this commit ships the regression gate that prevents the same bug
class from re-entering the SSSOM.
scripts/validate_sssom_invariants.py (new, 317 lines, stdlib-only):
csv + argparse + re. No sssom-py dependency (kept lightweight for
CI). Mirrors the structure + exit-code convention of
kg-microbe/mappings/validate_isolation_source_mappings.py:
0 = pass, 1 = missing input, 2 = violation (CI-blocking).
Rule A: For every row whose `source` column contains ONLY the auto
curator tags (MIM:curator=auto_classify_ingredient_type or
MIM:curator=backfill_parent_terms), accept iff at least one of:
* confidence ≥ 0.95
* ≥ 1 shared significant token between subject_label and
object_label (lowercase, ≥ 3 chars, stop-words removed —
_tokens() + _STOP_TOKENS copied verbatim from
culturebotai-claw/scripts/foodon_pass.py:64-105)
* any non-curator-auto tag in the source column (i.e. a human
curator pipeline like MIM:cbclaw_review_fix, MIM:CultureBotHT,
MIM:CultureMech touched it)
* the row's subject's MIM YAML carries an independent
chemical_properties.cas_rn or pubchem_cid (also accepts the
legacy `pubchem` key per the original spec) — external registry
corroboration tier
Rejected rows go to mappings/needs_curator_review.tsv with a
trailing reject_reason column. `# TODO PR2` markers reserve the
Rule B1-B4 slots for the structural-invariants pass.
mappings/needs_curator_review.tsv (new, 36 lines, header-only):
Pre-populated by running the validator against the current SSSOM.
Empty by design — every remaining auto-classifier row in the live
SSSOM carries at least one human-curator co-tag, and the 5
historical bad rows were already removed in 887ee9f. The validator
is now a pure regression gate; this file fills as future violations
arrive.
.github/workflows/qc-sssom.yaml (new, 41 lines):
Triggers on pull_request [main] and push [main] with paths matching
the SSSOM TSV, the triage TSV, or the validator script. Python
3.13. Uploads needs_curator_review.tsv as an artifact when present
for curator triage.
justfile (modified, +9 / -2):
+ qc-sssom recipe (just python3 scripts/validate_sssom_invariants.py)
Extended `qc:` composite from `qc: validate-all qc-evidence` to
`qc: validate-all qc-evidence qc-sssom` so the full QC sweep
exercises the new gate.
Smoke tests
- python3 scripts/validate_sssom_invariants.py against the current
SSSOM → exit 0, header-only triage TSV.
- just qc-sssom → same, exit 0.
- Synthetic mutation (MIM:2-Sulfobenzoic_Acid forced to
auto-classifier-only source + unrelated object_label + low
confidence) → exit 2; per-row stderr report names subject and
rejection tier.
Once this lands, kg-microbe's KNOWN_BAD_NARROWMATCH filter at
consolidate_chemical_mappings.py:1211-1217 becomes redundant —
follow-up downstream PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 3, 2026
D-Maltose_Monohydrate.yaml previously carried CHEBI:233428 (5-(N-ethyl- N-isopropyl) amiloride) as its ontology_mapping.ontology_id — a stale PubChem-xref hit on PubChem CID 23615261 that landed on a completely unrelated molecule. The bogus row was hand-removed from the published SSSOM in PR #2 (commit 887ee9f) but the YAML still contained the wrong CHEBI, so a future build_mim_ingredient_sssom would have re-emitted it. Repointed at CHEBI:17306 (maltose) as the correct narrowMatch parent; D-maltose monohydrate is a hydrate of D-maltose, and CHEBI has no specific term for the monohydrate form. The cas:6363-53-7 identifier remains the canonical primary (registry/identity row preserved via the dual-emission path). Also republishing mappings/ingredient_mappings.sssom.tsv via: python3 scripts/build_mim_ingredient_sssom.py python3 scripts/publish_sssom.py --apply Net delta vs the post-PR-#2 state: +5 rows — the same 5 subjects that were hand-removed are back, this time with CORRECT CHEBI mappings: MIM:Kh2po4 → CHEBI:63036 (potassium dihydrogen phosphate) MIM:Mncl2_X_2_H2o → CHEBI:131395 (manganese(II) chloride dihydrate) MIM:Mncl2_X_4_H2o → CHEBI:86368 (manganese(II) chloride tetrahydrate) MIM:Mncl2_anhydrous → CHEBI:63041 (manganese(II) chloride) MIM:D-Maltose_Monohydrate → CHEBI:17306 (maltose, narrowMatch) + cas:6363-53-7 (exactMatch dual) Made possible by the upstream defensive token-overlap gate added in culturebotai-claw commit (build_mim_ingredient_sssom: defensive token-overlap gate on CONSIDER_SPECIFIC). The build now refuses to honor a residual-P2.5 CONSIDER_SPECIFIC override when MIM's label and kg-microbe's label share zero significant tokens, so the 4 MnCl2/Kh2po4 ingredients (whose YAMLs already had correct CHEBIs) no longer get flipped to kg-microbe's wrong ones. Validator state on the new SSSOM: python3 scripts/validate_sssom_invariants.py → exit 0 (Rule A clean) sssom validate → OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
auto_classify_ingredient_typepipeline emitted 5skos:narrowMatchrows where the chemistry on both sides is unrelated. TheCONSIDER_SPECIFICnotes on each row already documented the divergence; this PR removes them so they can no longer be picked up as parent-of relationships by downstream SSSOM consumers.Removed rows
MIM:Kh2po4(KH2PO4)CHEBI:32583calcium sulfate dihydrateMIM:Mncl2_X_2_H2o(MnCl2·2H2O)CHEBI:30200kaempferol 3-O-β-D-glucosideMIM:Mncl2_X_4_H2o(MnCl2·4H2O)CHEBI:30200MIM:Mncl2_anhydrous(MnCl2)CHEBI:30200MIM:D-Maltose_MonohydrateCHEBI:2334285-(N-ethyl-N-isopropyl) amilorideAlso updated the comment on the kept
MIM:D-Maltose_Monohydrate exactMatch cas:6363-53-7row to drop the now-bogus reference to CHEBI:233428 as the "parent".Discovery
Found by Codex adversarial review of Knowledge-Graph-Hub/kg-microbe#558. The kg-microbe consolidator (
scripts/consolidate_chemical_mappings.py) added aKNOWN_BAD_NARROWMATCHfilter as a safety net in commit 7bc3fd72 — once this PR merges, that filter becomes redundant and can be removed in a follow-up.Follow-up curation
The MnCl2_* and KH2PO4 ingredients no longer appear in MIM after this PR (they had no other rows). A subsequent curation pass should re-add them with the correct CHEBI parents:
Test plan
mappings/ingredient_mappings.sssom.tsv🤖 Generated with Claude Code