Skip to content

Commit 7bc3fd7

Browse files
realmarcinclaude
andcommitted
Address Codex adversarial review #558 — three high-priority findings
Codex flagged three ways the recent subclass-plumbing work would poison the merged-kg with semantically wrong relationships. All three fixes ship together because they are interdependent (the loader trust policy interacts with the placeholder fallback emit, and the narrowMatch filter interacts with the get_parents() index). Finding 1 [HIGH] — manual closeMatch rows promoted to canonical nodes ===================================================================== File: kg_microbe/utils/isolation_source_mapping_utils.py mappings/validate_isolation_source_mappings.py The loader's _row_is_trusted() accepted any row tagged ``semapv:ManualMappingCuration`` regardless of predicate. That admitted 41 manually-curated ``skos:closeMatch`` rows, including: * Catheter → NCIT:C50344 (Catheter Device) — device, not source * Child → PATO:0001190 (juvenile) — quality, not source * Humid → NCIT:C88206 (Humidity) — quality, not source * Psychrophilic-<10°C → METPO:1000614 — phenotype class, not source * Boreal → ENVO:01000174 (forest biome) — biome name mismatch Tightened trust policy: substitution into the BacDive graph requires ``skos:exactMatch`` regardless of curator. closeMatch rows fall back to placeholder isolation_source:* nodes. Two acceptable trust paths within exactMatch: high-confidence auto-match OR manual curation. Net effect: 207 → 158 trusted mappings; 49 closeMatch rows correctly drop instead of poisoning the graph. The standalone validator's _row_is_trusted() is updated to match (test_validator_rules_match_loader enforces the parity). Finding 2 [HIGH] — bad MIM narrowMatch rows generate false subclass edges ========================================================================== File: scripts/consolidate_chemical_mappings.py MIM's auto_classify_ingredient_type pipeline produced 5 narrowMatch rows where the chemistry on both sides is unrelated: * MIM:Kh2po4 → CHEBI:32583 (KH2PO4 vs calcium sulfate dihydrate) * MIM:Mncl2_X_2_H2o → CHEBI:30200 (MnCl2 vs kaempferol glycoside) * MIM:Mncl2_X_4_H2o → CHEBI:30200 * MIM:Mncl2_anhydrous → CHEBI:30200 * MIM:D-Maltose_Monohydrate → CHEBI:233428 (maltose vs amiloride analog) Without this filter, get_parents() exposed those rows to MediaDive's new biolink:subclass_of emit path (commit f3a8199), which would have made the maltose ingredient a subclass of an unrelated amiloride analog in the merged-kg. Added KNOWN_BAD_NARROWMATCH set in load_mediaingredientmech_sssom() that drops these specific (subject_id, object_id) pairs at row-load time. The filter is idempotent — when MIM upstream removes the rows it becomes a no-op for us. Verified: regenerated unified file has ``cas:6363-53-7 parents []`` and the parallel cases for KH2PO4 and MnCl2 hydrates. Finding 3 [MEDIUM] — blanket ENVO subclass_of for all isolation_source placeholders ==================================================================================== File: kg_microbe/transform_utils/bacdive/bacdive.py The previous commit (959baa6) emitted ``isolation_source:* biolink:subclass_of ENVO:01000254`` for every unmapped isolation_source placeholder. But the table intentionally leaves labels like 'Human', 'Leaf-Phyllosphere', and 'host_animal_endotherm_intratissue' unmapped, and those are NOT environmental materials — they're hosts / anatomy / niches. A blanket ENVO parent would poison downstream reasoning over source type. Removed the blanket subclass_of edge. Placeholders stay unparented until a vetted host/anatomy/environment mapping lands in mappings/isolation_source_to_ontology.tsv. The mediadive.solution → CHEBI:60004, kgmicrobe.assay → MICRO:0000903, kgmicrobe.pathway → GO:0008152 emits all stay (those are correct single-parent types). Verified ======== * python mappings/validate_isolation_source_mappings.py → OK * poetry run pytest tests/test_isolation_source_mapping_utils.py tests/test_chemical_mapping_utils.py tests/test_consolidate_chemical_mappings.py tests/test_metatraits.py → 110 passed * Consolidator regenerates unified_ingredient_mappings.sssom.tsv.gz cleanly: 5 known-bad narrowMatch dropped at MIM load. * test_loader_honors_manually_curated_fixes updated to match new policy (Plant→Viridiplantae was a closeMatch row that no longer qualifies; Mammals→Mammalia is exactMatch and still honored). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f3a8199 commit 7bc3fd7

6 files changed

Lines changed: 93 additions & 40 deletions

File tree

kg_microbe/transform_utils/bacdive/bacdive.py

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2890,30 +2890,23 @@ def run(self, data_file: Union[Optional[Path], Optional[str]] = None, show_statu
28902890
subject_id = ISOLATION_SOURCE_PREFIX + isol_source.lower()
28912891
# Only write a placeholder node when no ontology mapping exists;
28922892
# mapped CURIEs get their canonical node from the ontologies transform.
2893+
# Note: we deliberately do NOT add a blanket
2894+
# ``biolink:subclass_of ENVO:01000254`` edge for these
2895+
# placeholders. Many unmapped labels are NOT environmental
2896+
# materials (e.g. 'Human', 'Leaf-Phyllosphere',
2897+
# 'host_animal_endotherm_intratissue' are hosts / anatomy /
2898+
# niches), and a blanket parent assertion would poison the
2899+
# hierarchy for downstream reasoning over source type
2900+
# (Codex review #558). Placeholders stay unparented until a
2901+
# vetted host/anatomy/environment mapping is added to
2902+
# mappings/isolation_source_to_ontology.tsv.
28932903
node_writer.writerow(
28942904
self._create_node_row(
28952905
subject_id,
28962906
ISOLATION_SOURCE_CATEGORY,
28972907
isol_source,
28982908
)
28992909
)
2900-
# Type the residual placeholder under ENVO:01000254
2901-
# (environmental material) so OBO-aware reasoners can navigate
2902-
# from any unmapped isolation source back to the canonical
2903-
# ENVO hierarchy. Curated mappings (handled in the `if`
2904-
# branch above) get their canonical parent from the ontologies
2905-
# transform; only the kg-microbe-minted placeholders need this.
2906-
edge_writer.writerow(
2907-
[
2908-
subject_id,
2909-
"biolink:subclass_of",
2910-
"ENVO:01000254",
2911-
"rdfs:subClassOf",
2912-
self.source_name,
2913-
"knowledge_assertion",
2914-
"manual_agent",
2915-
]
2916-
)
29172910
# Write edge from the isolation source to organism
29182911
knowledge_level, agent_type = self._add_edge_metadata(
29192912
NCBI_TO_ISOLATION_SOURCE_EDGE, LOCATION_OF, organism_id

kg_microbe/utils/isolation_source_mapping_utils.py

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -102,16 +102,38 @@ def _row_passes_family_check(row: Dict[str, str]) -> bool:
102102

103103

104104
def _row_is_trusted(row: Dict[str, str]) -> bool:
105-
"""Apply the trust policy described in the module docstring."""
105+
"""
106+
Apply the trust policy described in the module docstring.
107+
108+
Substitution into the BacDive graph requires ``skos:exactMatch`` —
109+
i.e. the curator (or the auto-matcher's high-confidence pass) has
110+
asserted that the BacDive label and the ontology term denote the
111+
SAME entity. ``skos:closeMatch`` rows are NOT trusted for canonical
112+
node substitution because they only assert similarity; promoting
113+
them would connect organisms to devices, qualities, phenotype
114+
classes, etc., as if those were the source entity itself
115+
(Codex adversarial review #558 found 41 such bad-substitution
116+
candidates in the table — Catheter→NCIT 'Catheter Device',
117+
Child→PATO juvenile, Humid→NCIT humidity quality, etc.).
118+
119+
Two acceptable trust paths, both requiring exactMatch:
120+
1. Auto-matcher hit with high confidence
121+
(``skos:exactMatch`` + ``confidence == 'high'``).
122+
2. Manual curation
123+
(``skos:exactMatch`` + ``mapping_justification ==
124+
'semapv:ManualMappingCuration'``).
125+
126+
Anything else — closeMatch under any justification, low/medium
127+
auto-matcher confidence — is dropped, leaving the BacDive transform
128+
to emit its placeholder ``isolation_source:*`` node.
129+
"""
106130
predicate = (row.get("predicate_id") or "").strip()
131+
if predicate != "skos:exactMatch":
132+
return False
133+
107134
confidence = (row.get("confidence") or "").strip().lower()
108135
justification = (row.get("mapping_justification") or "").strip()
109-
110-
if predicate == "skos:exactMatch" and confidence == "high":
111-
return True
112-
if justification == "semapv:ManualMappingCuration":
113-
return True
114-
return False
136+
return (confidence == "high") or (justification == "semapv:ManualMappingCuration")
115137

116138

117139
def load_isolation_source_mappings(
-145 Bytes
Binary file not shown.

mappings/validate_isolation_source_mappings.py

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -60,21 +60,17 @@ def _row_is_trusted(row: Dict[str, str]) -> bool:
6060
"""
6161
Mirror of the loader's trust policy.
6262
63-
The validator only fails on rows that *would* be loaded at runtime — a
64-
low-trust ``ols4_auto`` ``skos:closeMatch`` row is already dropped by the
65-
loader, so flagging it here would be noise. Promoting such a row by
66-
setting it to ``skos:exactMatch`` / high confidence / manual curation is
67-
what makes a family-mismatched mapping actually dangerous, and that is
68-
the case the validator is designed to catch.
63+
Both paths require ``skos:exactMatch``: closeMatch is never trusted for
64+
canonical node substitution because it only asserts similarity, not
65+
equivalence (see Codex review #558). Two trust paths within exactMatch:
66+
high-confidence auto-match, or manual curation.
6967
"""
7068
predicate = (row.get("predicate_id") or "").strip()
69+
if predicate != "skos:exactMatch":
70+
return False
7171
confidence = (row.get("confidence") or "").strip().lower()
7272
justification = (row.get("mapping_justification") or "").strip()
73-
if predicate == "skos:exactMatch" and confidence == "high":
74-
return True
75-
if justification == "semapv:ManualMappingCuration":
76-
return True
77-
return False
73+
return (confidence == "high") or (justification == "semapv:ManualMappingCuration")
7874

7975

8076
def iter_validation_failures(

scripts/consolidate_chemical_mappings.py

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1198,8 +1198,27 @@ def load_mediaingredientmech_sssom(self, filepath: Path):
11981198
# seeding carried forward obsolete MIM xrefs).
11991199
current_mim_subjects: dict[str, str] = {}
12001200

1201+
# Known-bad (subject_id, object_id) narrowMatch pairs from MIM's
1202+
# auto_classify_ingredient_type pipeline. These were flagged in the
1203+
# 2026-05-02 Codex adversarial review of PR #558 — the chemistry on
1204+
# both sides is unrelated (e.g. KH2PO4 → calcium sulfate dihydrate;
1205+
# MnCl2 → kaempferol glycoside; D-Maltose → amiloride analog).
1206+
# Without this filter the runtime emits demonstrably false
1207+
# ``biolink:subclass_of`` edges via MediaDive's get_parents() path.
1208+
# Drop these specific rows here; the proper upstream fix is to
1209+
# remove them in MIM (sibling repo). The filter is idempotent — a
1210+
# later MIM update that removes the rows is a no-op for us.
1211+
KNOWN_BAD_NARROWMATCH = {
1212+
("MIM:Kh2po4", "CHEBI:32583"), # KH2PO4 vs CaSO4·2H2O
1213+
("MIM:Mncl2_X_2_H2o", "CHEBI:30200"), # MnCl2·2H2O vs kaempferol glycoside
1214+
("MIM:Mncl2_X_4_H2o", "CHEBI:30200"), # MnCl2·4H2O vs kaempferol glycoside
1215+
("MIM:Mncl2_anhydrous", "CHEBI:30200"), # MnCl2 vs kaempferol glycoside
1216+
("MIM:D-Maltose_Monohydrate", "CHEBI:233428"), # maltose vs amiloride analog
1217+
}
1218+
12011219
added = 0
12021220
skipped_unsupported = 0
1221+
skipped_known_bad = 0
12031222
curator_tags_seen: set[str] = set()
12041223
for _, row in df.iterrows():
12051224
object_id = row.get("object_id", "").strip()
@@ -1219,6 +1238,17 @@ def load_mediaingredientmech_sssom(self, filepath: Path):
12191238

12201239
predicate = row.get("predicate_id", "").strip()
12211240
subject_id = row.get("subject_id", "").strip()
1241+
1242+
# Drop the specific narrowMatch rows MIM's auto-classifier got
1243+
# wrong (see KNOWN_BAD_NARROWMATCH definition above). Done at
1244+
# row-load time rather than at export so the rows never enter
1245+
# the parent_relations capture in the first place.
1246+
if (
1247+
predicate in {"skos:narrowMatch", "skos:broadMatch"}
1248+
and (subject_id, object_id) in KNOWN_BAD_NARROWMATCH
1249+
):
1250+
skipped_known_bad += 1
1251+
continue
12221252
subject_label = row.get("subject_label", "").strip()
12231253
object_label = row.get("object_label", "").strip()
12241254
other = row.get("other", "").strip()
@@ -1326,7 +1356,8 @@ def load_mediaingredientmech_sssom(self, filepath: Path):
13261356

13271357
print(
13281358
f" Loaded {added} MIM SSSOM entries "
1329-
f"(skipped {skipped_unsupported} unsupported object_id prefix)"
1359+
f"(skipped {skipped_unsupported} unsupported object_id prefix"
1360+
f", {skipped_known_bad} known-bad narrowMatch)"
13301361
)
13311362
if curator_tags_seen:
13321363
print(

tests/test_isolation_source_mapping_utils.py

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,22 @@ def test_loader_drops_family_mismatched_rows(mappings):
6161

6262

6363
def test_loader_honors_manually_curated_fixes(mappings):
64-
"""Rows promoted by manual curation (family_mismatch_fix and other curator tags) are honored."""
65-
# 'Plant' was wrongly mapped to ENVO:01001255 (a process); manual fix points at Viridiplantae taxon
66-
assert mappings.get("plant") == ("NCBITaxon:33090", "Viridiplantae")
67-
# 'Mammals' was wrongly mapped to NCBITaxon:32525 (Theria, a subgroup); manual fix points at Mammalia
64+
"""Rows promoted by manual curation are honored — but ONLY when predicate is skos:exactMatch.
65+
66+
The 2026-05 Codex adversarial review tightened the trust policy so that
67+
``skos:closeMatch`` rows are no longer trusted for canonical node
68+
substitution, even when manually curated, because closeMatch only
69+
asserts similarity (not equivalence) and would connect organisms to
70+
devices, qualities, and phenotype classes as if those were the source
71+
entity itself.
72+
"""
73+
# 'Mammals' was wrongly mapped to NCBITaxon:32525 (Theria, a subgroup); manual fix points
74+
# at Mammalia with skos:exactMatch — honored under the new policy.
6875
assert mappings.get("mammals") == ("NCBITaxon:40674", "Mammalia")
76+
# 'Plant' was promoted to NCBITaxon:33090 (Viridiplantae) via skos:closeMatch — NOT honored
77+
# under the tightened policy (manual curation alone isn't enough; needs exactMatch).
78+
# The BacDive transform falls back to the isolation_source:plant placeholder.
79+
assert mappings.get("plant") is None
6980

7081

7182
def test_loader_rejects_low_trust_lexical_close_matches(mappings):

0 commit comments

Comments
 (0)