Commit 7bc3fd7
Address Codex adversarial review #558 — three high-priority findings
Codex flagged three ways the recent subclass-plumbing work would
poison the merged-kg with semantically wrong relationships. All three
fixes ship together because they are interdependent (the loader
trust policy interacts with the placeholder fallback emit, and the
narrowMatch filter interacts with the get_parents() index).
Finding 1 [HIGH] — manual closeMatch rows promoted to canonical nodes
=====================================================================
File: kg_microbe/utils/isolation_source_mapping_utils.py
mappings/validate_isolation_source_mappings.py
The loader's _row_is_trusted() accepted any row tagged
``semapv:ManualMappingCuration`` regardless of predicate. That admitted
41 manually-curated ``skos:closeMatch`` rows, including:
* Catheter → NCIT:C50344 (Catheter Device) — device, not source
* Child → PATO:0001190 (juvenile) — quality, not source
* Humid → NCIT:C88206 (Humidity) — quality, not source
* Psychrophilic-<10°C → METPO:1000614 — phenotype class, not source
* Boreal → ENVO:01000174 (forest biome) — biome name mismatch
Tightened trust policy: substitution into the BacDive graph requires
``skos:exactMatch`` regardless of curator. closeMatch rows fall back
to placeholder isolation_source:* nodes. Two acceptable trust paths
within exactMatch: high-confidence auto-match OR manual curation.
Net effect: 207 → 158 trusted mappings; 49 closeMatch rows correctly
drop instead of poisoning the graph.
The standalone validator's _row_is_trusted() is updated to match
(test_validator_rules_match_loader enforces the parity).
Finding 2 [HIGH] — bad MIM narrowMatch rows generate false subclass edges
==========================================================================
File: scripts/consolidate_chemical_mappings.py
MIM's auto_classify_ingredient_type pipeline produced 5 narrowMatch
rows where the chemistry on both sides is unrelated:
* MIM:Kh2po4 → CHEBI:32583 (KH2PO4 vs calcium sulfate dihydrate)
* MIM:Mncl2_X_2_H2o → CHEBI:30200 (MnCl2 vs kaempferol glycoside)
* MIM:Mncl2_X_4_H2o → CHEBI:30200
* MIM:Mncl2_anhydrous → CHEBI:30200
* MIM:D-Maltose_Monohydrate → CHEBI:233428 (maltose vs amiloride analog)
Without this filter, get_parents() exposed those rows to MediaDive's
new biolink:subclass_of emit path (commit f3a8199), which would
have made the maltose ingredient a subclass of an unrelated amiloride
analog in the merged-kg.
Added KNOWN_BAD_NARROWMATCH set in load_mediaingredientmech_sssom()
that drops these specific (subject_id, object_id) pairs at row-load
time. The filter is idempotent — when MIM upstream removes the rows
it becomes a no-op for us. Verified: regenerated unified file has
``cas:6363-53-7 parents []`` and the parallel cases for KH2PO4
and MnCl2 hydrates.
Finding 3 [MEDIUM] — blanket ENVO subclass_of for all isolation_source placeholders
====================================================================================
File: kg_microbe/transform_utils/bacdive/bacdive.py
The previous commit (959baa6) emitted
``isolation_source:* biolink:subclass_of ENVO:01000254`` for every
unmapped isolation_source placeholder. But the table intentionally
leaves labels like 'Human', 'Leaf-Phyllosphere', and
'host_animal_endotherm_intratissue' unmapped, and those are NOT
environmental materials — they're hosts / anatomy / niches. A blanket
ENVO parent would poison downstream reasoning over source type.
Removed the blanket subclass_of edge. Placeholders stay unparented
until a vetted host/anatomy/environment mapping lands in
mappings/isolation_source_to_ontology.tsv. The mediadive.solution →
CHEBI:60004, kgmicrobe.assay → MICRO:0000903, kgmicrobe.pathway →
GO:0008152 emits all stay (those are correct single-parent types).
Verified
========
* python mappings/validate_isolation_source_mappings.py → OK
* poetry run pytest tests/test_isolation_source_mapping_utils.py
tests/test_chemical_mapping_utils.py
tests/test_consolidate_chemical_mappings.py
tests/test_metatraits.py → 110 passed
* Consolidator regenerates unified_ingredient_mappings.sssom.tsv.gz
cleanly: 5 known-bad narrowMatch dropped at MIM load.
* test_loader_honors_manually_curated_fixes updated to match new
policy (Plant→Viridiplantae was a closeMatch row that no longer
qualifies; Mammals→Mammalia is exactMatch and still honored).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent f3a8199 commit 7bc3fd7
6 files changed
Lines changed: 93 additions & 40 deletions
File tree
- kg_microbe
- transform_utils/bacdive
- utils
- mappings
- scripts
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2890 | 2890 | | |
2891 | 2891 | | |
2892 | 2892 | | |
| 2893 | + | |
| 2894 | + | |
| 2895 | + | |
| 2896 | + | |
| 2897 | + | |
| 2898 | + | |
| 2899 | + | |
| 2900 | + | |
| 2901 | + | |
| 2902 | + | |
2893 | 2903 | | |
2894 | 2904 | | |
2895 | 2905 | | |
2896 | 2906 | | |
2897 | 2907 | | |
2898 | 2908 | | |
2899 | 2909 | | |
2900 | | - | |
2901 | | - | |
2902 | | - | |
2903 | | - | |
2904 | | - | |
2905 | | - | |
2906 | | - | |
2907 | | - | |
2908 | | - | |
2909 | | - | |
2910 | | - | |
2911 | | - | |
2912 | | - | |
2913 | | - | |
2914 | | - | |
2915 | | - | |
2916 | | - | |
2917 | 2910 | | |
2918 | 2911 | | |
2919 | 2912 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
106 | 130 | | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
107 | 134 | | |
108 | 135 | | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
| 136 | + | |
115 | 137 | | |
116 | 138 | | |
117 | 139 | | |
| |||
Binary file not shown.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
69 | 67 | | |
70 | 68 | | |
| 69 | + | |
| 70 | + | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
| 73 | + | |
78 | 74 | | |
79 | 75 | | |
80 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1198 | 1198 | | |
1199 | 1199 | | |
1200 | 1200 | | |
| 1201 | + | |
| 1202 | + | |
| 1203 | + | |
| 1204 | + | |
| 1205 | + | |
| 1206 | + | |
| 1207 | + | |
| 1208 | + | |
| 1209 | + | |
| 1210 | + | |
| 1211 | + | |
| 1212 | + | |
| 1213 | + | |
| 1214 | + | |
| 1215 | + | |
| 1216 | + | |
| 1217 | + | |
| 1218 | + | |
1201 | 1219 | | |
1202 | 1220 | | |
| 1221 | + | |
1203 | 1222 | | |
1204 | 1223 | | |
1205 | 1224 | | |
| |||
1219 | 1238 | | |
1220 | 1239 | | |
1221 | 1240 | | |
| 1241 | + | |
| 1242 | + | |
| 1243 | + | |
| 1244 | + | |
| 1245 | + | |
| 1246 | + | |
| 1247 | + | |
| 1248 | + | |
| 1249 | + | |
| 1250 | + | |
| 1251 | + | |
1222 | 1252 | | |
1223 | 1253 | | |
1224 | 1254 | | |
| |||
1326 | 1356 | | |
1327 | 1357 | | |
1328 | 1358 | | |
1329 | | - | |
| 1359 | + | |
| 1360 | + | |
1330 | 1361 | | |
1331 | 1362 | | |
1332 | 1363 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
68 | 75 | | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
69 | 80 | | |
70 | 81 | | |
71 | 82 | | |
| |||
0 commit comments