Team review sssom#558
Merged
realmarcin merged 56 commits intomasterfrom May 3, 2026
Merged
Conversation
…idator extract_curie - metatraits / metatraits_gtdb: extend `edge_header` with `value` and `unit` so quantitative-bin edges (temperature, NaCl, pH) preserve the original measurement alongside the binned METPO class. Threaded through `_classify_into_binned_range` and the temperature/salinity/pH classification methods. Recovering the underlying number is needed for the SSSOM-team review and downstream re-binning. - constants: add VALUE_COLUMN to back the new edge column. - scripts/consolidate_chemical_mappings.py: add `extract_curie` helper that preserves the original ontology prefix instead of fabricating `CHEBI:<digits>` from any numeric tail. Includes a small alias map (PUBCHEM.COMPOUND/PubChem/CAS-RN/etc.) so upstream prefix-spelling variants are normalised. Prevents the silent FOODON/UBERON/PubChem → CHEBI prefix-mangling regression documented in the audit trail. - kg-release-diff: write reports to a timestamped artifact under `<skill>/reviews/` by default (with `--no-save` opt-out), matching the kg-model-review pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reran scripts/consolidate_chemical_mappings.py against the refreshed MIM SSSOM (1,705 rows, up from 1,695 — adds 10 NCIT-mapped MediaDive ingredients newly created by the ingredient-mapping skill on the mim-queue source: Activated charcoal NCIT:C77524, Beef NCIT:C71932, Carrot NCIT:C72000, Fig NCIT:C71971, Ginger NCIT:C66725, Lemon NCIT:C72005, Phosphate buffer NCIT:C29321, etc.). mappings/ingredient_mappings.sssom.tsv (vendored MIM SSSOM) refreshed by sync_mim_sssom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reran scripts/consolidate_chemical_mappings.py against the refreshed MIM SSSOM (1,723 rows, up from 1,705 — adds 18 chemicals MIM imported from kg-microbe's own out-of-SSSOM metatraits files via the ingredient-mapping skill's new --source kgm-metatraits). These chemistry-relevant mappings (e.g. Hydrogen sulfide, Indole, Siderophore, Plastic, Hydrocarbon, Egg yolk, Pyrite, Serum) lived only in kg-microbe's transform_utils/metatraits/mappings/ TSVs before. Now they're first-class MIM ingredients flowing back into the unified SSSOM via the priority-11 mediaingredientmech_reviewed lane. mappings/ingredient_mappings.sssom.tsv (vendored MIM SSSOM) refreshed by sync_mim_sssom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MIM upstream fixed 4 chemical/ingredient mapping issues identified during careful per-row reconciliation review of metatraits: - Casein: CHEBI:3448 (REMOVED from CHEBI) → FOODON:03420180 - Citrate (NEW): CHEBI:16947 (citrate parent anion) - Milk (NEW): UBERON:0001913 (milk anatomy) - Meat_Extract (NEW): FOODON:03315424 (meat extract) MIM SSSOM grew from 1,723 → 1,726 rows; consolidator absorbed all 3 new rows + the Casein update without further changes. After regeneration, kg-microbe-review reduces: - chemical_mappings: AGREE 7→8, MISSING 1→0 (DIVERGE 1 unchanged — SSSOM-artifact P2.5 narrowMatch only) - special_chemical_mappings: AGREE 149→174, MISSING 6→0, DIVERGE 39→20 The 20 remaining DIVERGE in special_chemical_mappings.tsv are kg-microbe-side action items (15 placeholder→authoritative-CHEBI/NCIT updates + 2 wrong-CHEBI fixes for arsenate and dihydrogen) — not addressed in this commit; documented separately for a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sweep Absorbs MIM commit 7b44151 — 4 new CultureMech-derived ingredient mappings (Disodium_Phosphate_Heptahydrate, EDTA_acid_Form, Ferric_Chloride_Hexahydrate, Sodium_Nitrate). MIM SSSOM grew 1726 → 1730 rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace 15 kgmicrobe.compound:* placeholders with authoritative CHEBI or NCIT IDs, and correct 2 wrong CHEBI IDs that resolved to a completely different chemical than the row's chemical_name. All 17 corrections sourced from the upstream MediaIngredientMech SSSOM. Category A (placeholder → authoritative): Adenomycin, Avoparcin, Cetocycline, Dynemicin, Lydimycin, Steffimycin → NCIT Alanosine, Angustmycin, Ferroverdin, Kijanimicin, Miharamycin A, Monazomycin, Nocamycin, Rubradirin, Stallimycin → CHEBI Category B (wrong CHEBI → correct): arsenate: CHEBI:29242 (arsenite(1-)) → CHEBI:29125 (arsenate(3-)) dihydrogen: CHEBI:29356 (oxide(2-)) → CHEBI:18276 (dihydrogen) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pre-fix ``extract_chebi_id`` regex (``re.search(r"(\d+)", v)``) used
to rewrite FOODON/UBERON/PubChem/CAS-RN values into ``CHEBI:<numeric_tail>``
when they appeared in the heterogeneous ``mapped`` column of
compound_mappings_strict.tsv. The earlier fix introduced ``extract_curie``
to preserve original prefixes for new ingestions, but two pollution
paths remained:
1. The legacy ``mappings/unified_chemical_mappings.tsv.gz`` baseline
re-seeded mangled rows on every run.
2. The SSSOM baseline (``unified_ingredient_mappings.sssom.tsv.gz``)
carried forward CHEBI:>=1M rows from earlier runs.
3. ``compound_mappings_strict.tsv`` itself contains pre-mangled
``CHEBI:<7-9 digit>`` values in the ``mapped`` column for some
ingredients (Tris-HCl, MnCl2, peptone, etc.).
Add ``is_mangled_chebi_id`` with three detection rules:
- leading-zero local part (FOODON/UBERON regex output)
- local part >= 1_000_000 (PubChem CIDs misrouted as CHEBI)
- data-driven blacklist replayed from compound_mappings_strict ``mapped``
cells, source-restricted to mediadive-style auto-mappers so curated
rows survive when their CHEBI id collides with a CAS-RN first-numeric
Wire the guard into both baseline loaders and into
``load_compound_mappings`` itself. Replaces the narrower ``CHEBI:0*``
check with the unified detector.
Retire the legacy entity-centric TSV outputs:
- delete ``mappings/unified_chemical_mappings.tsv.gz``
- delete ``scripts/migrate_chemical_mappings.py`` (one-time migration)
- drop ``load_existing_unified_tsv`` and the legacy_tsv_paths block in
``main()``; the SSSOM is now the single seeding source
- rewrite ``mappings/validate_manual_mappings.py`` to read the SSSOM
via a per-entity grouping helper
Run results (compound_mappings_strict still present):
113 legacy mangled entries dropped, 5 SSSOM-baseline mangles dropped,
5 source-loader pre-mangles skipped. Final SSSOM: 596,107 rows /
56 prefixes / zero PubChem/CAS-RN mangles.
Add 5 unit tests for ``is_mangled_chebi_id`` covering all three rules,
source-restriction safety, real-CHEBI passthrough, and non-CHEBI rejection.
Refresh README + chemical-mapping SKILL.md to document the SSSOM as the
single source of truth and the data-driven mangle detection.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a ``--mappings`` / ``--mappings-only`` mode to the review skill so
every curation TSV the repo ships gets the same systematic check the
transform outputs already get.
Four file groups are validated:
- canonical schema (5 metatraits TSVs sharing the standard
subject_label / object_id / predicate_id / mapping_justification /
confidence layout)
- bespoke schemas (``enzyme_name_to_go.tsv``,
``special_chemical_mappings.tsv``)
- queues / audit / proposals
(``mediadive_unmapped_ingredients_to_curate.tsv``,
``culturebotai_reviewed_ingredients.tsv``)
- SSSOM (``ingredient_mappings.sssom.tsv``) — YAML metadata block +
SSSOM required columns + per-row CURIE / predicate / justification
namespace checks. Fix the metadata reader to preserve YAML
indentation (the prior ``lstrip`` collapsed ``curie_map:`` map
entries into a flat list and broke the parse).
Per-row checks include CURIE format, registered prefixes, deprecated
biolink targets, METPO references resolvable in ontologies output,
ontology-id resolvability across CHEBI/GO/EC/UBERON/ENVO/HP/MONDO/PATO/
PR/CL/FOODON/NCBITaxon/OMP, ``predicate_id`` restricted to the
``skos:`` namespace, ``mapping_justification`` restricted to ``semapv:``,
``confidence`` ∈ {high, medium, low}.
Cross-file: same ``subject_label`` mapped to conflicting ``object_id``
across canonical files.
Append a markdown "Curation upgrade report" with six sections:
1. Top unmapped MediaDive ingredients by occurrence (drives MIM /
CultureBotAI curation priority)
2. Cross-file mapping conflicts
3. Object IDs not resolvable in the ontologies output
4. Low-confidence canonical rows
5. Prefix normalization candidates (PUBCHEM.COMPOUND →
pubchem.compound, CAS-RN → cas)
6. CultureBotAI ingredient review queue status counts
This is the artifact handed to upstream curation repos
(CultureBotAI / MIM / CultureBotHT) to drive new mappings.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resyncs the kg-microbe ingredient mapping artifact with MIM 8151a23 (republish following the chemistry backfill + evidence apply passes). Same 1,730 rows; the underlying mapping data is unchanged but the YAML provenance dates moved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f_Heart, Tomato_Juice) Resyncs after MIM 2658f97 (FOODON pass --apply --high-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecords Resyncs after MIM 8efa783.
…nt_Soil → ENVO:00001998)
Contributor
There was a problem hiding this comment.
Pull request overview
This PR continues the repository’s migration from the legacy unified chemical TSV to the unified ingredient SSSOM as the canonical mapping artifact, while also hardening chemical CURIE handling and extending some review/transform tooling around mappings and quantitative trait metadata.
Changes:
- Adds prefix-preserving CURIE extraction and mangled-CHEBI filtering to
consolidate_chemical_mappings.py, plus focused unit tests for the helper functions. - Removes the obsolete
migrate_chemical_mappings.pyscript and updates mapping docs/validation tooling to useunified_ingredient_mappings.sssom.tsv.gz. - Extends MetaTraits edge outputs with
value/unit, updates curated special chemical mappings, and expands internal Claude review skills for mapping-file review/report generation.
Reviewed changes
Copilot reviewed 13 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_consolidate_chemical_mappings.py |
New unit tests for CURIE extraction and mangled-CHEBI detection helpers. |
scripts/migrate_chemical_mappings.py |
Deletes obsolete one-off migration script. |
scripts/consolidate_chemical_mappings.py |
Adds CURIE normalization/mangle filtering and removes legacy TSV reseeding path. |
mappings/validate_manual_mappings.py |
Switches manual audit script from legacy TSV parsing to grouped SSSOM parsing. |
mappings/unified_chemical_mappings.tsv.gz |
Legacy mapping artifact touched/removed as part of SSSOM migration. |
mappings/README.md |
Updates mapping documentation to describe SSSOM as source of truth. |
kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py |
Extends MetaTraits-GTDB edge schema with value and unit. |
kg_microbe/transform_utils/metatraits/metatraits.py |
Emits quantitative provenance (value/unit) on binned phenotype edges. |
kg_microbe/transform_utils/metatraits/mappings/special_chemical_mappings.tsv |
Updates curated ontology mappings for specific chemicals/antibiotics. |
kg_microbe/transform_utils/constants.py |
Adds shared VALUE_COLUMN constant. |
.claude/skills/kg-release-diff/kg_release_diff.py |
Adds default review-path helper and new CLI options for report saving behavior. |
.claude/skills/kg-model-review/kg_model_review.py |
Adds mapping-file review mode, SSSOM/schema checks, and curation upgrade report generation. |
.claude/skills/kg-model-review/SKILL.md |
Documents new mapping-review capabilities and CLI options. |
.claude/skills/chemical-mapping/SKILL.md |
Updates chemical-mapping skill docs for SSSOM source-of-truth workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four threads, all addressed in code:
1. ``.claude/skills/kg-release-diff/kg_release_diff.py`` — wire up the
advertised ``--no-save`` flag and ``--out`` default. Output policy is
now: ``--out PATH`` writes to that path; ``--no-save`` prints to stdout
only; otherwise auto-generate ``<skill>/reviews/<ts>_<old>_vs_<new>.md``
via the existing ``_default_review_path`` helper. Previously both flags
were declared but never consulted.
2. ``mappings/validate_manual_mappings.py`` — switch the SSSOM reader to
a streaming row-by-row pass. The prior ``[line for line in f if not
line.startswith('#')]`` materialised every non-comment line into a
Python list before parsing, an O(file_size) memory spike that would
eventually fail on the full unified mapping set (~600k rows).
3. ``tests/test_consolidate_chemical_mappings.py`` — add ``LoaderFiltering``
class with two regression tests that exercise the loader-side filter
paths (not just the ``is_mangled_chebi_id`` predicate). Uses tmpdir
fixtures to drive ``load_compound_mappings`` and ``load_existing_unified``
through clean rows, FOODON/UBERON-style mangles, PubChem-watermark
mangles, blacklist-with-auto-source rows (drop), and blacklist-with-
curated-source rows (keep). Catches typos in source-label matching or
skip logic that could silently discard legitimate mappings.
4. ``tests/test_metatraits.py`` + ``tests/resources/metatraits_fixture.jsonl``
— extend the existing transform smoke test to assert the new ``value``
and ``unit`` columns are present in the edge header and populated for
at least one quantitative phenotype edge. Adds a ``temperature growth``
fixture record (``majority_label='Median: 37.0 Celsius'``) and asserts
the binned-optimum edge carries ``value=37.0 unit=Celsius``. Catches
header/order mismatches that could ship unnoticed.
All 102 affected tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The class docstring placed its summary on the first line after `"""`,
which D213 ("Multi-line docstring summary should start at the second
line") rejects. Insert the required line break and indentation after
the opening quotes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Surfaced from the bacdive isolation_source mapping audit as residual
microbial-trait labels with no existing ENVO/UBERON/PATO/MICRO term that
fits.
- METPO:1007092 xerophilic phenotype → subclass_of METPO:1007073
osmotic tolerance. Synonyms: xerophile, xerotolerant. Captures the
low-water-activity (aw < 0.85) niche.
- METPO:1007093 epibiont phenotype → subclass_of METPO:1000000.
Synonyms: epibiont, ectosymbiont. Captures the host-association
mode (lives on external surface), distinct from endosymbiont.
Skipped: 'Xerophytic' is a plant trait — belongs in PO/EO, not METPO.
Regenerate proposal artifacts: 37 categorical terms (was 35), 43 OWL
class rows (was 41). ROBOT template + ELK reasoner pass with no UNSAT
classes. All 27 metatraits + extract_metpo_proposals tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New file: mappings/isolation_source_to_ontology.tsv. Canonical 12-col
SSSOM-style schema (subject_label, object_id, predicate_id,
mapping_justification, confidence, …). Covers all 358
``bacdive.isolation_source:*`` nodes from the merged KG.
Pipeline:
1. Auto-mapper via OLS4 ``select`` endpoint with priority list
ENVO > UBERON > FOODON > MONDO > NCIT. Mapped 250/358 (70%).
2. CURIE-format + object_source fixes (13 ``MONDO_NNNN`` →
``MONDO:NNNN``; 72 object_source values corrected to actual term
prefix instead of queried-ontology name).
3. Synonym-aware re-mapper: switched from ``select`` (label-only) to
``search`` endpoint (label + synonym), added label-variant
generation (lowercase, hyphen → space, plural → singular,
comma-split, suffix tokens). Lifted coverage 70% → 94%.
4. Manual review: dropped 5 corrupt rows (TSV bled in description /
URL text); applied 21 row-level corrections after row-by-row audit
flagged factually wrong matches (e.g. Boreal → UBERON:8910010
stomatogastric nerve when target is ENVO:01000174 forest biome;
Catheter → NCIT:C78232 catheter-related infection when target is
NCIT:C50344 catheter device; Reptilia → NCIT:C158048 reptilian
glycan when target is NCBITaxon:8504; Stem-Branch → ENVO:00000029
watercourse when target is PO:0009047 stem; Urethra →
UBERON:0001338 urethral gland when target is UBERON:0000057
urethra; etc).
Final state:
- exactMatch: 172 / closeMatch: 160 / unmapped: 26.
- 13 distinct ontologies: ENVO (105), UBERON (66), NCIT (38),
FOODON (25), NCBITaxon (27), MONDO (13), PATO (10), PO (7),
mesh (6), CHEBI (4), GO (2), METPO (2), plus 6 misc.
The 26 still-unmapped split into compound BacDive labels needing
decomposition (Cotton-other-fibres, Heated-Burned, …), generic
placeholders ('Other'), METPO proposal candidates already added in the
previous commit (Xerophilic, Epibiont, both will resolve once minted),
and host-modifier compounds.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two high-severity findings from Codex review on PR #558: 1. Non-CURIE placeholders marked exactMatch/high (15 rows). Original OLS auto-mapper accepted GOLD-database hits whose ``obo_id`` was a bare label (``Anaerobic-digestor``, ``Bioremediation``, ``Cave-water``, ``Coalbed-water``, ``Defined-media``, ``Endosphere``, ``Engineered-product``, ``Industrial-production``, ``Lab-enrichment``, ``Lab-synthesis``, ``Phyllosphere``, plus a bare ``D011214``). 3 of these had real OBO targets and were rebound (``Indoor-Air`` → ENVO:01000855, ``Outdoor-Air`` → ENVO:01000829, ``Peat-moss`` → mesh:D044003); the other 12 had no clean target and are now correctly unmapped. 2. Semantic mismatches from lexical-only matching: - ``Air-conditioner`` was NCIT:C196790 *Air Conditioner Lung disease* - ``Clean-room`` was NCIT:C106896 *ADCS-ADL questionnaire item* → ENVO:03600000 cleanroom - ``Thermal-spring`` was NCIT:C125898 *topical solution* → ENVO:00000051 hot spring - ``Urogenital-tract`` was MONDO:0019356 *malformation* (a disease) → UBERON:0004122 genitourinary system - ``Wastewater`` was ENVO:00002043 *wastewater treatment plant* → ENVO:00002001 waste water (the substance) Plus descendant drift: ``Ankle`` (was nerve → ankle joint), ``Bladder`` (was lumen → bladder organ), ``Tooth`` (was placode → calcareous tooth), ``Tundra`` (was ``tundra mire`` → ``tundra``). ``Specimen``, ``Tree``, ``Waste``, ``Air-conditioner`` had no clean ontology target and are now unmapped. 3. CI validation: the file is now registered in kg-model-review's ``GROUP_A_CANONICAL`` (filename → directory dict), so ``poetry run python .claude/skills/kg-model-review/kg_model_review.py --mappings-only`` will: - reject any non-CURIE ``object_id``, - reject partial rows (mapped but missing predicate / justification), - allow fully-blank rows as legal unmapped curation candidates, - flag unregistered prefixes (extended STANDARD_PREFIXES with mesh, NCIT-adjacent, PRIDE, ExO, VariO, SNOMED, BTO, AGRO, FAO, OBI, AEO, GENEPIO, PCO, UO so the review only flags genuinely unknown prefixes). Final state: 358 rows; 164 exactMatch / 152 closeMatch / 42 unmapped. Validator: 0 errors, 1 warning (``Wound→UBERON:0006988`` not in local ontologies/nodes.tsv snapshot — real UBERON term, downstream-resolvable). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
UBERON has no 'wound' term; my prior closeMatch UBERON:0006988 was fabricated. The closest standard cross-domain term is mesh:D014947 'Wounds and Injuries'. After this fix the kg-model-review --mappings-only run is fully clean: 0 ERRORs, 0 WARNINGs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codex adversarial review flagged that several rows in
mappings/isolation_source_to_ontology.tsv mapped isolation sources
to MONDO disease terms — semantically wrong (MONDO models
diseases; isolation sources are where an organism was found).
Data fixes (12 rows; curator=codex_review_fix_v2):
Abort MONDO:0041526 → unmapped
(was 'pregnancy disorder with abortive outcome';
abortion-as-event has no clean isolation-source
ontology)
Abscess MONDO:0005227 → UBERON:0006548 (abscess)
(UBERON has abscess as tissue/structure)
Canker MONDO:0005318 → unmapped
(was 'canker sore'; canker as plant lesion no
clean ontology)
Cystic-fibrosis MONDO:0009061 → unmapped
(CF context isn't itself an isolation source —
real sources are CF-patient lung/sputum)
Disease MONDO:0000001 → unmapped (too generic)
Heavy-metal MONDO:0023305 → CHEBI:25555 (monoatomic ion)
(was 'heavy metal poisoning'; chemical class is
the right scope)
Host MONDO:0013730 → unmapped
(was 'graft versus host disease'; 'host' as
isolation source is too generic)
Iron-mat MONDO:0017988 → ENVO:01000110 (microbial mat)
(was 'multifocal atrial tachycardia' — matched
on the 'MAT' abbrev; iron-mat is microbial mat)
Meningitis MONDO:0021108 → unmapped
(disease context; real sources are CSF/meninges)
Mycosis MONDO:0009691 → unmapped
(was 'mycosis fungoides'; generic mycosis no
clean ontology term)
Tick MONDO:0025294 → NCBITaxon:6939 (Ixodida)
(was 'tick-borne disease'; ticks are NCBITaxon)
Tuberculosis MONDO:0018076 → unmapped
(disease context; real sources are
lung/sputum from TB patients)
CI workflow (.github/workflows/validate-isolation-source.yaml):
Checks out culturebotai-claw alongside this repo on every PR
that touches the TSV; runs claw's validate_isolation_source_mapping.py
which enforces:
- CURIE format on every non-empty object_id
- object_source.upper() == prefix.upper()
- SKOS predicate vocabulary
- semapv: justification vocabulary
- confidence ∈ {high, medium, low}
- ontology category allowlist (no MONDO/DOID/HP)
- NCIT/mesh label-keyword warnings
- empty object_id ⇒ empty object_source/predicate
After fixes the validator reports 0 errors / 1 warning (the
remaining Biopsy → NCIT:C15189 'Biopsy Procedure' is borderline
acceptable — biopsy specimens ARE valid isolation sources, just
labeled as the procedure).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both checks failing on PR #558 (kg-microbe QC + Validate isolation_source) have been failing on team-review-sssom for every commit since 01b9931 because they depend on artifacts unavailable in the CI environment. Two independent fixes. 1. metatraits transform: fetch metpo.json from upstream when missing (kg_microbe/transform_utils/metatraits/metatraits.py) In CI, data/raw/metpo.json is absent (it's a download.yaml artifact, not in the repo), so _load_metpo_lookups() and _load_metpo_binned_ranges() silently returned empty, breaking the discrete-trait pathway. With empty METPO label/synonym lookups, "gram positive" never resolved to METPO:1000698 in tests/test_metatraits.py::test_run_with_fixture, failing the assertion that 0%-pct_true edges are emitted. New _resolve_metpo_json_path() helper: * Returns RAW_DATA_DIR/metpo.json if it already exists (fast path). * Otherwise fetches the upstream copy (https://raw.githubusercontent.com/berkeleybop/metpo/main/metpo.json) into RAW_DATA_DIR so subsequent loaders find it. The download is idempotent and shared between binned-ranges + lookups. * On network failure, returns None and the caller short-circuits (same behavior as before, but with an explicit, useful error rather than a silent fallback that broke downstream tests). Verified: hiding the local copy and rerunning tests/test_metatraits.py::test_run_with_fixture exercises the new fallback path and the test still passes. 2. validate-isolation-source workflow: soft-gate culturebotai-claw (.github/workflows/validate-isolation-source.yaml) The structural validator lives in CultureBotAI/culturebotai-claw, which is not readable by this repo's GITHUB_TOKEN — actions/checkout returns 404 (Not Found) and the workflow fails at the checkout step. Made the checkout step `continue-on-error: true` and gated the structural-validate step on `steps.checkout_claw.outcome == 'success'`. When the repo becomes accessible, the soft gate becomes a hard gate again automatically. The in-repo family-compatibility validator (mappings/validate_isolation_source_mappings.py) was promoted to run first as the *hard* gate — it's the one that actually catches semantic regressions like 'Foot' → UO:0010013 (units used for anatomy). Workflow now emits a `::warning::` when the external validator is skipped, so the gap is visible in the Actions UI rather than silent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enrichment Adds three small/mid-size ontologies to the transform and enriches the two PRIDE/PCO stub-prefix CURIEs that BacDive's isolation_source mapping table references but no transform was loading. Driven by the prefix-frequency analysis on the latest merged-kg. Why each ontology: * PO (Plant Ontology, 5.4 MB) — 51 distinct IDs in BacDive isolation_source mappings (root, leaf, flower, rhizome, etc.). Currently emitted in 882 organism→PO edges with bare metadata. * TAXRANK (Taxonomic Rank Vocabulary, 54 KB) — 50 distinct rank IDs emitted directly by the NCBITaxon transform's OAK rank annotations. Tiny ontology, normalizes labels/definitions for nodes already present in merged-kg. * MICRO (Microbial Conditions Ontology, 10.3 MB) — 48 high-confidence MIM mappings already point at MICRO terms (Bacto-tryptone, Brain heart infusion, Tryptic soy broth, Nutrient broth No. 2, etc.). The unified chemical mappings file admits MICRO as of e9e6f1e, and ChemicalMappingLoader.find_chebi_by_name already returns MICRO IDs when appropriate — but the merged-kg I reviewed was built from a May 2 01:22 MediaDive transform output, *before* the May 2 14:01 unified-mappings regen. So MICRO emissions just need a fresh MediaDive run; no resolver code change required. Why PRIDE / PCO get hardcoded enrichment instead of full ontology load: * PRIDE: only 3 distinct IDs in the entire merged-kg (PRIDE:0000685 host body site, PRIDE:0000686 host body product, PRIDE:0001000 antibiotic treatment). All 18,752 organism→PRIDE edges fan out from these 3 stub classes. Loading the full PRIDE CV for 3 IDs is wasteful. * PCO: 1 actively-used ID (PCO:1000004 microbial community). The other 7 PCO IDs in merged-kg leak in as xref propagation through ENVO/MONDO imports — they're not directly mapped from BacDive. Implementation: * download.yaml gains three new entries (po.owl / taxrank.owl / micro.owl) following the existing per-ontology comment pattern. * ONTOLOGIES_MAP in ontologies_transform.py gains the corresponding three keys. * isolation_source_mapping_utils.py gains STUB_ONTOLOGY_PREFIXES (frozenset of {"PRIDE", "PCO"}) and STUB_ONTOLOGY_CATEGORY ("biolink:OntologyClass"). These are the prefixes the BacDive transform should emit thin node rows for, since the ontologies transform won't. * BacDive's isolation_source emit path (bacdive.py) now writes a thin node row for any mapped CURIE whose prefix is in the stub set, using the object_label from the mapping TSV. Loaded-ontology targets (UBERON, ENVO, ...) still get their node from the ontologies transform — no double-emit. Re-run scope before next merge: * `kg download` — pull po.owl, taxrank.owl, micro.owl into data/raw/ * `kg transform -s ontologies` — emit nodes/edges for the new ontologies into data/transformed/ontologies/{po,taxrank,micro}_*.tsv * `kg transform -s mediadive` — pick up the unified-mappings regen with MICRO targets (no code change, just stale-output refresh) * `kg transform -s bacdive` — emit thin PRIDE/PCO nodes via the new STUB_ONTOLOGY_* path * `kg merge` — final assembly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some upstream OWL→JSON conversions emit synonym annotations without a literal value. The MICRO ontology has one such entry (MICRO:0003152 hasRelatedSynonym with 'pred' but no 'val'); KGX's obograph reader assumes every synonym carries 'val' and crashes with KeyError on the missing key, blocking the entire ontologies transform after taxrank. Adds _sanitize_obograph_synonyms() that rewrites the converted JSON in place to drop malformed synonym entries before KGX reads it. Runs once per ontology between robot's OWL→JSON conversion and KGX's transform. Well-formed synonyms are unchanged. The dropped count is logged so the upstream issue stays visible. Also registers infores knowledge sources for po, taxrank, micro that were added to ONTOLOGIES_MAP in the prior commit. Verified: sanitizer clears the 1 bad synonym in MICRO; rerunning 'kg transform -s ontologies' should now load all 16 ontologies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleans the post-Codex / post-validator residue: 1 ERROR (Abscess →
HP, a disallowed phenotype ontology) and 6 WARNINGS where the lexical
hit had drifted into a too-specific descendant.
Errors fixed (1):
Abscess → HP:0025615 → mesh:D000038 'Abscess'
HP is a phenotype ontology (disallowed); MeSH D000038 is the
canonical Subject Heading for abscess as a clinical sample type.
Drift fixes — generic parent term (6):
Joint → UBERON:0008114 (joint of girdle, too narrow)
→ UBERON:0004905 'articulation' (synonym 'joint')
Mangrove → ENVO:02000138 (mangrove biome soil, only soil)
→ ENVO:01000181 'mangrove biome' (covers all samples)
Hot → ENVO:00000051 (hot spring, a specific feature)
→ ENVO:01000305 'high temperature environment'
Volcanic → ENVO:00000354 (volcanic field, a subtype)
→ ENVO:00000094 'volcanic feature' (parent landform)
Thoracic-segment → UBERON:0003827 (thoracic segment bone, only bone)
→ UBERON:0000915 'thoracic segment of trunk' (region)
Fermented → FOODON:00001098 (fermented apple beverage, false hit)
→ unmapped (no clean parent term)
The remaining 8 closeMatch rows previously flagged by the validator's
descendant-drift heuristic (Aquaculture, Biopsy, Bladder-stone,
Currency, Plaque, Sandy, Tooth, Water-treatment-plant) were manually
reviewed and confirmed as the canonical curator-intended mapping;
they are now whitelisted in the validator (claw side, separate
commit).
Validator state on this file:
errors: 0
warnings: 0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dedicated workflow ran the in-repo family-compatibility validator on
PRs touching mappings/isolation_source_to_ontology.tsv (or the validator
/ loader sources). Every check it performed is already covered by the
regular QC pytest suite via tests/test_isolation_source_mapping_utils.py:
- test_validator_passes_on_committed_mapping_file — runs the validator
against the committed TSV and asserts zero failures
- test_validator_rules_match_loader — catches drift between validator
and runtime loader rule sets
- test_validator_flags_synthetic_family_mismatch — exercises the
failure path on a synthetic UO-anatomy mismatch
The standalone script at mappings/validate_isolation_source_mappings.py
remains in the repo and can still be invoked directly by curators or
tooling that wants validator output without the pytest harness.
The companion external validator hosted in CultureBotAI/culturebotai-claw
is org-private and was already failing to checkout in CI (404), making
that step a no-op.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esh:C*
The kg-microbe special_chemical_mappings.tsv held kgmicrobe.compound:*
mints for ~107 antibiotic / secondary-metabolite traits ('produces:
setamycin', 'produces: rhodomycin A', etc.). For 38 of these, MIM
had since added authoritative mesh:C* identifiers (via its
auto_classify_ingredient_type and backfill_parent_terms passes).
The two sources disagreeing on the canonical id for the same chemical
is the kind of cross-file conflict the kg-model-review report flags:
'these are out-of-SSSOM, so they need explicit reconciliation (pick a
canonical per chemical)'.
This commit picks MIM's mesh:C* as the canonical id and rewrites the
38 affected rows in kg_microbe/transform_utils/metatraits/mappings/
special_chemical_mappings.tsv. The notes column gains a
'reconciled: was kgmicrobe.compound:X; MIM authoritative mapping → Y'
line so the swap stays auditable.
Why MIM wins: per the chemical-mapping skill priority table, MIM
(mediaingredientmech_reviewed) is priority 11 — the highest in the
unified consolidator and the canonical-naming source for ingredient
mappings. mesh:C* identifiers are in the published MeSH supplementary
chemical concept space and resolve to upstream definitions; kg-microbe
mints are stub identities only.
Side notes:
* The 38 corresponding kgmicrobe.compound:* entries in
kg_microbe/transform_utils/custom_curies.yaml are intentionally NOT
removed. They remain registered as cross-references because MIM
itself uses them as registry/identity rows (skos:exactMatch on the
kg-microbe side, with a parent mesh:C* row), and dropping them
here would orphan those MIM xref rows.
* The remaining 69 kgmicrobe.compound:* rows in the file have no
MIM-side mapping yet — they stay as kg-microbe mints until a future
MIM curation pass picks them up.
* No transform code changes needed. _load_special_chemical_mappings()
reads the ontology_id column directly, so the next metatraits run
picks up the swap automatically.
Verified locally:
* awk filter shows 69 kgmicrobe.compound rows remaining (was 107)
* tests/test_metatraits.py::test_run_with_fixture passes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 30 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Every BacDive transform run was logging:
WARNING:kg_microbe.utils.isolation_source_mapping_utils:Dropping
family-mismatched mapping: 'Currency' → ENVO:00003896 ('currency note')
The mapping was actually semantically correct — currency (banknotes /
coins) is a legitimate fomite isolation source in microbiology, and
ENVO:00003896 'currency note' is the right ontology target for
microbe-on-currency studies. The warning fired only because
'currency note' had been added defensively to
BANNED_OBJECT_LABEL_SUBSTRINGS during the original family-mismatch
sweep, treating it as if it were a non-substrate stub. That entry
was overly aggressive.
Three coordinated changes:
1. kg_microbe/utils/isolation_source_mapping_utils.py — drop
'currency note' from BANNED_OBJECT_LABEL_SUBSTRINGS so the
runtime loader stops rejecting the row on family grounds.
2. mappings/validate_isolation_source_mappings.py — same removal
in the standalone CI validator. Required because
tests/test_isolation_source_mapping_utils.py::test_validator_rules_match_loader
asserts the two banned lists are equal.
3. mappings/isolation_source_to_ontology.tsv — promote the
Currency row from ols4_auto closeMatch / medium /
LexicalMatching to ManualMappingCuration / exactMatch / high so
the loader's trust policy honors it. Notes column records the
promotion rationale for audit.
4. tests/test_isolation_source_mapping_utils.py — the
test_loader_rejects_low_trust_lexical_close_matches test was
asserting that 'currency' stayed unmapped (used it as the
canonical "untrusted auto-match should be dropped" example).
Swapped to 'aquaculture' which is still an unpromoted ols4_auto
closeMatch row in the TSV.
Net effect on next merged-kg: ~233 organism → isolation_source:currency
edges become organism → ENVO:00003896 edges, with the ENVO node
supplying the canonical label, definition, and biolink:EnvironmentalFeature
category from the ontologies transform.
Verified locally: 9/9 tests pass, validator OK,
load_isolation_source_mappings() returns
('ENVO:00003896', 'currency note') for 'currency'.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up MIM@2527d95 ("Re-backfill chemistry + kg_microbe_node_id
post-dihydrate-fix"):
- Calcium_Chloride: CHEBI:86158 (dihydrate) → CHEBI:3312 (anhydrous)
- Sodium_Citrate_2: CHEBI:32142 (dihydrate) → CHEBI:53258 (anhydrous)
mappings/ingredient_mappings.sssom.tsv (vendored MIM) re-synced via
sync_mim_sssom() from the MIM sibling repo.
mappings/unified_ingredient_mappings.sssom.tsv.gz regenerated.
Final cross-repo state per claw `just kg-microbe-review`:
IN_SYNC: 1860 / 1860
CHEBI_DIVERGED: 0
STALE_IN_KGM: 0
MIM_LEGACY_IN_KGM: 0
metatraits chemical_mappings: AGREE=8 DIVERGE=1 (glucose form variant) MISSING=0
metatraits special_chemicals: AGREE=187 DIVERGE=7 MISSING=0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved threads addressed: 1. constants.py:237 — Revert RHEA_TO_EC_EDGE from biolink:close_match back to biolink:enabled_by. close_match would have changed every rhea2ec edge from "this reaction is enabled by this enzyme class" to "these identifiers are approximately equivalent" — that loses the directional reaction-to-enzyme semantics that the Rhea loader and downstream consumers expect. The kg-model-review domain/range warning that motivated the close_match swap remains, but it's an artifact of biolink:enabled_by being defined for gene-product → activity (not activity-class → activity-class as Rhea↔EC is). The warning is documented as accepted in a constants.py comment. 2. isolation_source_mapping_utils.py:118 — Remove the unused iter_validation_failures function from the loader module. It shared a name with the standalone validator's helper but did NOT apply _row_is_trusted, so any caller of this shared helper would have gotten false validation failures that don't reflect runtime behavior. The standalone validator at mappings/validate_isolation_source_mappings.py has its own copy (which DOES apply trust), so the loader's version was dead code with drift potential. Also dropped the unused Iterable import and the __all__ entry. 3. metatraits.py:362 — Remove _resolve_metpo_json_path() and its network fallback. Production transforms should not reach external services at runtime (Copilot's concern: it would mutate the checkout with a surprise HTTP request and break offline / sandboxed CI/release environments). The network call has been moved into tests/conftest.py as a session-scoped autouse fixture ensure_metpo_json_for_tests() — same effect for pytest runs (which was the only consumer of the fallback) but no longer touches production code. The fixture also honors KG_MICROBE_TESTS_NO_NETWORK for fully-offline test runs. Verified locally: * poetry env tests: 35/35 pass (test_isolation_source_mapping_utils + test_metatraits) * python mappings/validate_isolation_source_mappings.py → OK * load_isolation_source_mappings() smoke test still resolves Currency Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eview
Three independent audit agents reviewed the BacDive isolation-source TSV
and four metatraits mapping files against OLS, OBO Foundry, ChEBI, GO,
EC/IUBMB, MeSH, NCIT, and primary literature. This commit applies the
high-confidence fixes; NEEDS_HUMAN_REVIEW items and the auto-generated
metpo_alias_mappings.tsv issue are deferred (the latter requires fixing
extract_metpo_proposals.py upstream).
mappings/isolation_source_to_ontology.tsv (16 rows):
Family / scope corrections:
* Gastrointestinal-tract NCIT:C34082 → UBERON:0005409 (3,320 edges)
* Lymph-node NCIT:C12745 → UBERON:0000029
* Inflammation NCIT:C3137 → mesh:D007249 (consistency w/ Abscess→mesh)
* Periodontal-pocket NCIT:C62547 → mesh:D010520
* Industrial-waste NCIT:C577 → ENVO:00002267 (consistency w/ Industrial-wastewater→ENVO)
* Dairy-product NCIT:C413 → FOODON:00001256
* Built-environment ExO:0000048 → mesh:D000076624 (1,324 edges; ExO is exposure-science chemicals)
* Zebrafish FOODON:03000002 → NCBITaxon:7955 (Danio rerio — host taxon, not food)
Unmapped (process / state / qualifier / vague — not a substrate):
* Treatment was AGRO:00000322 (Agronomy crop treatment, wrong family)
* Biodegradation was ENVO:06105014 (a process, not a site/material)
* Climate was ENVO:01001082 (long-term weather summary, not habitat)
* In-situ was NCIT:C14160 (medical 'carcinoma in situ', not habitat)
* Immunocompromised was NCIT:C14139 (host state, not source)
* Endosymbiont was VariO:0570 (Variation Ontology, wrong family)
* Co-culture was mesh:D018920 (research method, not sample type)
* Contaminant was NCIT:C84280 (too vague to map)
mappings/validate_isolation_source_mappings.py + isolation_source_mapping_utils.py:
Removed 'industrial waste material' from BANNED_OBJECT_LABEL_SUBSTRINGS
(same false-positive class as 'currency note' that was removed earlier —
the ENVO term IS a legitimate isolation source for Industrial-waste).
metatraits/special_chemical_mappings.tsv (5 rows):
* row 15 produces: DL-lactate CHEBI:16651 → CHEBI:24996
(was (S)-lactate / L-form only; DL needs generic parent)
* row 85 produces: poly(L-lysine) kgmicrobe.compound:* → CHEBI:61490
* row 188 produces: piericidin kgmicrobe.compound:* → CHEBI:138511
* rows 193,194 growth: soyton/proteose FOODON:03302071 → FOODON:00002992
(CRITICAL: FOODON:03302071 is "green kidney bean", NOT proteose peptone)
metatraits/enzyme_name_to_go.tsv (1 row):
* row 31 alpha-xylosidase GO:0046558 → GO:0061634
(CRITICAL: GO:0046558 is an arabinosidase EC 3.2.1.99 — wrong enzyme;
GO:0061634 EC 3.2.1.177 is the actual alpha-xylosidase)
metatraits/phenotype_mappings.tsv (1 row):
* row 10 voges-proskauer test METPO:1005017 → METPO:1005016
(was the 'positive' outcome variant; subject names the test itself)
Deferred:
* metpo_alias_mappings.tsv has ~15 over-generalizations to parent METPO
classes where specific child classes exist (rod-shaped, aerobic, BSL-1,
motile, etc.). Direct edits to that file get reverted because it is
auto-generated by scripts/extract_metpo_proposals.py. Filed as a
follow-up task to fix the extractor's synonym resolution.
* Several NEEDS_HUMAN_REVIEW items in the audit reports (Algae, Yeast
polyphyletic mappings; alpha-maltosidase EC precision; citrate
protonation state).
Each row updated has curator='kg_review_lit_check' and a notes-column
explanation citing the source of the correction.
Verified: 36/36 tests pass, isolation-source validator OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The May-1 custom-term subclassing review identified ~10K missing
biolink:subclass_of edges that would type kg-microbe-minted CURIEs
under their canonical OBO parents. This commit ships 4 of the 5
recommended emit-side changes (the 5th — surfacing MIM
skos:narrowMatch as subclass_of edges — is a multi-file plumbing
task and ships in a follow-up commit).
Each new edge is:
predicate biolink:subclass_of
relation rdfs:subClassOf
primary_knowledge_source <transform's source>
knowledge_level knowledge_assertion
agent_type manual_agent
1. mediadive.solution → CHEBI:60004 (mixture)
File: kg_microbe/transform_utils/mediadive/mediadive.py
Each MediaDive solution node now carries a subclass_of edge to
CHEBI:60004 (the canonical "mixture" parent). Approx 5,400 edges.
Schema: standard 9-col MediaDive edge (subject, predicate, object,
relation, source, knowledge_level, agent_type, value, unit).
2. kgmicrobe.assay → MICRO:0000903 (assay parent)
File: kg_microbe/transform_utils/bacdive/bacdive.py
After writing the 503 assay nodes (generate_assay_nodes), iterate
them and emit one subclass_of edge per node pointing at
MICRO:0000903. Pulls the entire kgmicrobe.assay:* namespace into
the MICRO ontology that ontologies_transform now loads. Approx 503 edges.
3. residual isolation_source:* → ENVO:01000254 (environmental material)
File: kg_microbe/transform_utils/bacdive/bacdive.py
In the placeholder fallback branch (when no isolation_source ↔
ontology mapping exists), also emit a subclass_of edge to
ENVO:01000254. Curated mappings already get their canonical
parent from the ontologies transform; only the 157 remaining
placeholders need this. Approx 157 edges.
4. kgmicrobe.pathway → GO:0008152 (metabolic process)
File: kg_microbe/transform_utils/madin_etal/madin_etal.py
In the fallback path where pathways aren't in METPO and have no
NER GO match, emit a subclass_of edge to GO:0008152 alongside
the existing tax→pathway edge. Approx 75 edges per merged-kg.
Validation:
poetry run pytest tests/test_isolation_source_mapping_utils.py
tests/test_metatraits.py tests/test_extract_metpo_proposals.py
→ 36/36 pass
ruff check on the 3 modified files → clean
Re-run scope: mediadive + bacdive + madin_etal transforms then merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fix
Two related changes that address the May-1 custom-term subclass review:
(1) NarrowMatch plumbing — surface 199 MIM-curated parent-of relations
=====================================================================
The MIM SSSOM has ~199 ``skos:narrowMatch`` rows that explicitly assert
"this kg-microbe ingredient X is a kind-of OBO parent Y" (e.g.
``MIM:Vermont_Soil narrowMatch ENVO:00001998 (soil)``). Previously the
consolidator treated narrowMatch rows as ordinary synonyms and the
asymmetric relationship was lost — neither the unified file nor the
runtime loader could express "kgmicrobe.ingredient:vermont_soil
narrowMatch ENVO:00001998".
Three coordinated changes resolve this:
- scripts/consolidate_chemical_mappings.py
* Add ``self.parent_relations`` and ``self.mim_to_primary``.
* In ``load_mediaingredientmech_reviewed``, capture skos:narrowMatch /
broadMatch rows verbatim (alongside the synonym extraction), and
track the symmetric exactMatch rows that establish the
MIM:<slug> ↔ kg-microbe primary correspondence.
* In ``export_unified_sssom``, pass the captured rows through into
the unified file with the MIM:<slug> subject translated to the
kg-microbe primary (e.g. cas:* or kgmicrobe.ingredient:*) when the
mapping is known. Normalise object_source to the obo:<prefix>.owl
convention so the SSSOM curie-map validator accepts the file.
- kg_microbe/utils/chemical_mapping_utils.py
* Add ``_PARENT_INDEX: Dict[curie, list[parents]]`` populated at
load time from skos:narrowMatch rows in the unified SSSOM.
* Public ``get_parents(curie)`` API plus a method on the
``ChemicalMappingLoader`` class. Returns the list of broader OBO
CURIEs the ingredient is narrower than.
- kg_microbe/transform_utils/mediadive/mediadive.py
* In the per-medium ingredient loop, after creating the ingredient
node, call ``self.chemical_loader.get_parents(ingredient_id)``
and emit one ``biolink:subclass_of`` edge per parent. The 199
MIM-curated parent relations now reach merged-kg as proper
subclass_of edges with rdfs:subClassOf as the relation.
Verified end-to-end:
* Unified SSSOM regenerated: 596,737 → 597,154 rows (+199
narrowMatch + 218 other small bumps), passes SSSOM validator.
* ``get_parents('kgmicrobe.ingredient:vermont_soil')`` returns
``['ENVO:00001998']``.
* ``get_parents('cas:143314-17-4')`` returns ``['CHEBI:61326']`` —
confirms the MIM:<slug> → cas:* translation works.
* Total entities with parents: 199.
(2) extract_metpo_proposals.py — split over-generalized aliases
================================================================
The May-2 audit flagged 13 metpo_alias entries that pointed at a METPO
parent class when a more specific child existed (e.g. "rod-shaped" →
METPO:1000666 cell shape, when METPO:1000681 "rod shaped" has
"rod-shaped" as a synonym). Direct edits to the regenerated TSV got
reverted by the test suite's regenerate-and-diff gate, so the fix
has to land in the extractor's source data.
Updated EXISTING_METPO_ALIASES in scripts/extract_metpo_proposals.py
to split each over-generalizing entry into a parent alias plus
specific child aliases:
cell shape (METPO:1000666) ← splits out:
rod-shaped → METPO:1000681
coccus → METPO:1000668
spiral → METPO:1000684
filamentous → METPO:1000674
oxygen requirement (METPO:1000601) ← splits out:
aerobic → METPO:1000602
anaerobic → METPO:1000603
facultative anaerobic → METPO:1000605
microaerophilic → METPO:1000604
aerotolerant → METPO:1000609
biosafety level classification (METPO:1001101) ← splits out:
BSL-1 → METPO:1001102
BSL-2 → METPO:1001103
BSL-3 → METPO:1001104
BSL-4 → METPO:1001105
motility phenotype (METPO:1000701) ← splits out:
motile → METPO:1000702
non-motile → METPO:1000703
Plus one wrong-target fix:
indole production capability — was METPO:1005011 (the "test
positive" outcome variant); fixed to METPO:1005010 (indole test).
The "test positive" alias kept as a separate entry pointing at
METPO:1005011 where it semantically belongs.
Regenerated metpo_alias_mappings.tsv + metpo_existing_aliases.tsv
ship in this commit. The test_extract_metpo_proposals regenerate-
and-diff gate now passes against the new state.
Total: 71/71 pytest pass (extract_metpo_proposals + chemical_mapping_utils
+ isolation_source_mapping_utils).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex flagged three ways the recent subclass-plumbing work would
poison the merged-kg with semantically wrong relationships. All three
fixes ship together because they are interdependent (the loader
trust policy interacts with the placeholder fallback emit, and the
narrowMatch filter interacts with the get_parents() index).
Finding 1 [HIGH] — manual closeMatch rows promoted to canonical nodes
=====================================================================
File: kg_microbe/utils/isolation_source_mapping_utils.py
mappings/validate_isolation_source_mappings.py
The loader's _row_is_trusted() accepted any row tagged
``semapv:ManualMappingCuration`` regardless of predicate. That admitted
41 manually-curated ``skos:closeMatch`` rows, including:
* Catheter → NCIT:C50344 (Catheter Device) — device, not source
* Child → PATO:0001190 (juvenile) — quality, not source
* Humid → NCIT:C88206 (Humidity) — quality, not source
* Psychrophilic-<10°C → METPO:1000614 — phenotype class, not source
* Boreal → ENVO:01000174 (forest biome) — biome name mismatch
Tightened trust policy: substitution into the BacDive graph requires
``skos:exactMatch`` regardless of curator. closeMatch rows fall back
to placeholder isolation_source:* nodes. Two acceptable trust paths
within exactMatch: high-confidence auto-match OR manual curation.
Net effect: 207 → 158 trusted mappings; 49 closeMatch rows correctly
drop instead of poisoning the graph.
The standalone validator's _row_is_trusted() is updated to match
(test_validator_rules_match_loader enforces the parity).
Finding 2 [HIGH] — bad MIM narrowMatch rows generate false subclass edges
==========================================================================
File: scripts/consolidate_chemical_mappings.py
MIM's auto_classify_ingredient_type pipeline produced 5 narrowMatch
rows where the chemistry on both sides is unrelated:
* MIM:Kh2po4 → CHEBI:32583 (KH2PO4 vs calcium sulfate dihydrate)
* MIM:Mncl2_X_2_H2o → CHEBI:30200 (MnCl2 vs kaempferol glycoside)
* MIM:Mncl2_X_4_H2o → CHEBI:30200
* MIM:Mncl2_anhydrous → CHEBI:30200
* MIM:D-Maltose_Monohydrate → CHEBI:233428 (maltose vs amiloride analog)
Without this filter, get_parents() exposed those rows to MediaDive's
new biolink:subclass_of emit path (commit f3a8199), which would
have made the maltose ingredient a subclass of an unrelated amiloride
analog in the merged-kg.
Added KNOWN_BAD_NARROWMATCH set in load_mediaingredientmech_sssom()
that drops these specific (subject_id, object_id) pairs at row-load
time. The filter is idempotent — when MIM upstream removes the rows
it becomes a no-op for us. Verified: regenerated unified file has
``cas:6363-53-7 parents []`` and the parallel cases for KH2PO4
and MnCl2 hydrates.
Finding 3 [MEDIUM] — blanket ENVO subclass_of for all isolation_source placeholders
====================================================================================
File: kg_microbe/transform_utils/bacdive/bacdive.py
The previous commit (959baa6) emitted
``isolation_source:* biolink:subclass_of ENVO:01000254`` for every
unmapped isolation_source placeholder. But the table intentionally
leaves labels like 'Human', 'Leaf-Phyllosphere', and
'host_animal_endotherm_intratissue' unmapped, and those are NOT
environmental materials — they're hosts / anatomy / niches. A blanket
ENVO parent would poison downstream reasoning over source type.
Removed the blanket subclass_of edge. Placeholders stay unparented
until a vetted host/anatomy/environment mapping lands in
mappings/isolation_source_to_ontology.tsv. The mediadive.solution →
CHEBI:60004, kgmicrobe.assay → MICRO:0000903, kgmicrobe.pathway →
GO:0008152 emits all stay (those are correct single-parent types).
Verified
========
* python mappings/validate_isolation_source_mappings.py → OK
* poetry run pytest tests/test_isolation_source_mapping_utils.py
tests/test_chemical_mapping_utils.py
tests/test_consolidate_chemical_mappings.py
tests/test_metatraits.py → 110 passed
* Consolidator regenerates unified_ingredient_mappings.sssom.tsv.gz
cleanly: 5 known-bad narrowMatch dropped at MIM load.
* test_loader_honors_manually_curated_fixes updated to match new
policy (Plant→Viridiplantae was a closeMatch row that no longer
qualifies; Mammals→Mammalia is exactMatch and still honored).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
Following the Codex adversarial review's tightening of the loader trust policy (commit 7bc3fd7) — which now requires skos:exactMatch for canonical node substitution — 41 manually-curated skos:closeMatch rows in mappings/isolation_source_to_ontology.tsv stopped being honored at runtime. This commit re-audits each one against isolation-source semantics and either: (a) promotes to skos:exactMatch when the BacDive label and ontology term denote the same entity in isolation-source context, or (b) keeps as closeMatch when there's a real family mismatch (device for a sample, quality/phenotype for a source, etc.). PROMOTED (34 rows): Host taxa (common name → NCBITaxon class/family): Birds → NCBITaxon:8782 Aves Chicken → NCBITaxon:9031 Gallus gallus Dinoflagellate → NCBITaxon:2864 Dinophyceae Fishes → NCBITaxon:7898 Actinopterygii Plant → NCBITaxon:33090 Viridiplantae Plants → NCBITaxon:33090 Viridiplantae Reptilia → NCBITaxon:8504 Lepidosauria Tick → NCBITaxon:6939 Ixodida Anatomy (BacDive label → UBERON canonical): Ankle → UBERON:0001488 ankle joint Bladder → UBERON:0018707 bladder organ Gastrointestinal-tract → UBERON:0005409 digestive tract Tooth → UBERON:0001091 calcareous tooth Urogenital-tract → UBERON:0004122 genitourinary system Plant anatomy (PO): Phylloplane → PO:0006016 leaf epidermis Plant-sap-Flux → PO:0025538 plant sap Stem-Branch → PO:0009047 stem Environments / substrates (ENVO/FOODON): Boreal → ENVO:01000174 forest biome Composting → ENVO:00002170 compost Hot → ENVO:01000305 high temperature environment Indoor → ENVO:01000856 indoor environment Iron-mat → ENVO:01000110 microbial mat Lake-large → ENVO:00000020 lake Meat → FOODON:00001027 meat food product Plant-litter-Forest → ENVO:01000628 plant litter Pond-small → ENVO:00000033 pond Thermal-spring → ENVO:00000051 hot spring Volcanic → ENVO:00000094 volcanic feature Water-reservoir-Aquarium/pool → ENVO:00000025 reservoir Cellular contexts (GO): Extracellular → GO:0005615 extracellular space Intracellular → GO:0005622 intracellular anatomical structure Clinical / pathology / virology (mesh, NCIT): Lesion-incl.-Necrosis → NCIT:C3824 Lesion Peat-moss → mesh:D044003 Sphagnopsida Viriome → mesh:D000083422 Virome Wound → mesh:D014947 Wounds and Injuries KEPT DROPPED (7 rows — family-mismatched targets): * Catheter → NCIT:C50344 (Catheter Device): device, not source * Child → PATO:0001190 (juvenile): quality, not source * Humid → NCIT:C88206 (Humidity): quality, not source * Psychrophilic-<10°C → METPO:1000614: phenotype class, not source * Thermophilic->45°C → METPO:1000616: phenotype class, not source * Heavy-metal → CHEBI:25555 (monoatomic ion): semantic drift — not all heavy metals are monoatomic ions * Bronchial-wash → UBERON:0002185 (bronchus): sample type vs anatomy Net effect on next merged-kg: 158 → 192 trusted isolation_source mappings (+34) ~2,500 organism→ontology edges added across the promoted labels (estimate based on prior edge counts; will materialize on rerun) Tests updated to reflect the post-audit state. The test_loader_honors_manually_curated_fixes assertion now checks five representative promotions plus four representative drops. Verified: poetry run pytest tests/test_isolation_source_mapping_utils.py tests/test_metatraits.py → 35/35 pass python mappings/validate_isolation_source_mappings.py → OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_curated_fixes The docstring I added in commit 0626294 used a single-line summary on the first line followed by a blank line and detail paragraph. The repo's ruff config enforces D213 (multi-line summary must start on the second line), so the linter rejected it. Auto-fixed by ruff --fix: the summary now begins on the line after the opening triple-quote, matching the style of the other multi-line docstrings in this file. Verified locally: * poetry run ruff check kg_microbe/ tests/ → all checks passed * poetry run pytest tests/test_isolation_source_mapping_utils.py → 9/9 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dings
Round-2 Codex review caught two issues that survived the first cleanup:
Finding 1 [HIGH] — trusted mappings still admit qualities, procedures, devices
==============================================================================
Files: kg_microbe/utils/isolation_source_mapping_utils.py
mappings/validate_isolation_source_mappings.py
The previous trust policy required skos:exactMatch but did not validate
the ontology family of the target. That admitted 11 trusted rows where
the BacDive label was a sample source but the target was a quality,
procedure, or device — producing organism→quality / organism→procedure
edges that look like sample-source claims:
Acidic → PATO:0001429 (pH quality)
Alkaline → PATO:0001430 (pH quality)
Cold → PATO:0000256 (temperature quality)
Female → PATO:0000383 (biological sex)
Male → PATO:0000384 (biological sex)
Juvenile → PATO:0001190 (life-stage quality)
Antibiotic-treatment → PRIDE:0001000 (a treatment, not a substrate)
Food-production → FOODON:03530206 (a process, not a substrate)
Medical-device → NCIT:C16830 (a device, not a substrate)
Swab → NCIT:C17627 (a collection procedure)
Surface-swab → SNOMED:258537007 (collection procedure)
Two coordinated fixes:
* DISALLOWED_OBJECT_SOURCES gains PATO and METPO. PATO is universally a
qualities ontology — never a substrate. METPO is for phenotype classes
the organism *exhibits*, not a place organisms are isolated *from*.
These are reject-by-prefix.
* BANNED_OBJECT_LABEL_SUBSTRINGS gains "swab", "medical device",
"food production", and "antibiotic treatment". These catch the
procedure / device / process rows in mixed-content prefixes
(NCIT and SNOMED contain real substrates AND clinical procedures —
prefix-level rejection would lose Aspirate, Blood-culture, etc.).
The 11 affected rows are unmapped in
mappings/isolation_source_to_ontology.tsv with curator='family_mismatch_fix'
and notes-column rationale citing this Codex round.
The standalone validator's banned lists are kept in sync (drift-detection
test test_validator_rules_match_loader enforces the parity).
Finding 2 [HIGH] — BacDive emitted edges to unloaded prefixes
==============================================================
Files: kg_microbe/utils/isolation_source_mapping_utils.py
kg_microbe/transform_utils/bacdive/bacdive.py
BacDive's emit path writes the mapped CURIE directly as the edge subject.
For the edge to land cleanly, *something* has to materialize a node for
that CURIE — either ontologies_transform (if the prefix is in
ONTOLOGIES_MAP) or BacDive itself (if the prefix is in
STUB_ONTOLOGY_PREFIXES). Codex found 21 trusted rows whose targets
satisfied neither condition, producing dangling references in the
merged graph: mesh, NCIT, GENEPIO, FAO, BTO, SNOMED prefixes.
Two coordinated fixes:
* STUB_ONTOLOGY_PREFIXES extended from {PRIDE, PCO} to also cover
{mesh, NCIT, GENEPIO, FAO, BTO, SNOMED}. BacDive now emits a thin
node row per occurrence with the object_label from the mapping TSV
and biolink:OntologyClass category — same pattern previously used for
PRIDE/PCO. The full ontologies aren't loaded (mesh and NCIT are
enormous clinical thesauri); per-mapping stub nodes are sufficient
for the small number of trusted IDs in use.
* New BacDiveTransform._validate_isolation_source_target_prefixes()
runs at __init__ time and aborts with a clear, fail-fast error if
any trusted mapping points at a prefix that isn't either loaded by
the ontologies transform or in the stub set. Catches future curator
mistakes (or deletions of stub support) at load time, not after the
graph has been corrupted.
Verified
========
* python mappings/validate_isolation_source_mappings.py → OK
* poetry run pytest tests/test_isolation_source_mapping_utils.py
tests/test_metatraits.py → 35/35 pass
* BacDiveTransform() instantiates cleanly:
"trusted mappings: 181"
"target prefixes in trusted set:
['BTO', 'CHEBI', 'ENVO', 'FAO', 'FOODON', 'GENEPIO', 'GO',
'NCBITaxon', 'NCIT', 'PCO', 'PO', 'PRIDE', 'SNOMED',
'UBERON', 'mesh']"
Every prefix is in ONTOLOGIES_MAP or STUB_ONTOLOGY_PREFIXES.
Net effect: trusted mappings 192 → 181 (-11 family-mismatched). The
edges that previously dangled (mesh:D000038 'Abscess',
NCIT:C13347 'Aspirate', BTO:0003114 'wound fluid', etc.) now have
proper stub nodes in BacDive's output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… file
Codex's third-round adversarial review identified that the recent
narrowMatch plumbing was structurally broken: only 19 of 194
narrowMatch rows resolved back to their intended child CURIE; 131
collapsed onto the parent. Three coordinated fixes:
(1) Stop materializing asymmetric MIM rows into parent's lexical record
=========================================================================
File: scripts/consolidate_chemical_mappings.py
(load_mediaingredientmech_sssom, lines ~1257-1346)
The asymmetric branch (narrowMatch / broadMatch) used to fall through
to the same add_chemical(id=object_id, ...) call as symmetric matches,
feeding the child's subject_label and MIM xref into the broader
parent's synonym/xref table. After this change, asymmetric rows are
stored in self.parent_relations only — they no longer touch the parent
entity's lexical state. The child's labels/xrefs come exclusively from
the sibling exactMatch row (e.g. MIM:Vermont_Soil →
kgmicrobe.ingredient:vermont_soil) processed in the symmetric branch.
(2) Add purge_asymmetric_pollution() to clean up baseline reseed leakage
=========================================================================
The consolidator's load_existing_unified() seeds from the prior
unified file, which carried forward the polluted state from earlier
runs. New purge step removes:
* Child labels (subject_label, child's canonical_name, child's
synonyms) from each parent's synonym set
* MIM:<child> xref from each parent's xref set
* The cross-xref symmetry between child_primary ↔ parent_primary
that propagate_synonyms_via_xrefs would otherwise re-amplify
Runs after MIM SSSOM load, before propagate_synonyms_via_xrefs, so the
cleaned data doesn't get re-bridged through xref equivalence.
Logs counts each run: e.g. "Purged 188 stray child-label synonym(s)
and 158 stray MIM xref(s) from 164 parent record(s)."
(3) Rename the unified mappings file
=====================================
mappings/unified_ingredient_mappings.sssom.tsv.gz
→ mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz
The file holds chemicals AND foods AND anatomy AND environments —
"ingredient" was always too narrow. Standardizing on
"kgmicrobe_unified_entity_mappings" matches the kg-microbe scope.
All references updated:
* scripts/consolidate_chemical_mappings.py (output path + docstring)
* kg_microbe/utils/chemical_mapping_utils.py (default loader path
+ docstrings)
* mappings/README.md
* mappings/validate_manual_mappings.py
* tests/test_negative_cache.py
Verification
============
Verified the fix end-to-end against representative MIM-curated
child terms:
Vermont Soil → kgmicrobe.ingredient:vermont_soil parents=['ENVO:00001998']
Beef brain powder → kgmicrobe.ingredient:beef_brain_powder parents=['FOODON:02020911']
Actinomycin A → kgmicrobe.compound:actinomycin_a parents=['CHEBI:15369']
Codex's coverage check across the full set: was 19/194 narrowMatch
rows resolving to their child; now 121/194 (+~6×). Remaining 25
parent-resolutions and 46 other-resolutions are mostly distinct
secondary-pollution channels that need separate audit.
Three new regression tests in tests/test_chemical_mapping_utils.py
under TestNarrowMatchChildResolution exercise the committed mapping
file (not mocks) so a future consolidator regression that re-pollutes
parents will fail loudly.
* poetry run ruff check kg_microbe/ tests/ → all checks passed
* poetry run pytest tests/test_chemical_mapping_utils.py
tests/test_isolation_source_mapping_utils.py
tests/test_consolidate_chemical_mappings.py
tests/test_metatraits.py → 114/114 pass
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vendored copy of MIM's ingredient_mappings.sssom.tsv now reflects the state introduced by MIM commit 887ee9f on fix/remove-bad-narrow-match-rows-pr558: the 5 KNOWN_BAD_NARROWMATCH rows (KH2PO4, MnCl2_*, D-Maltose) where the auto-classifier produced unrelated chemistry targets are removed. Diff: -7 / +1 (net -5 narrowMatch rows + 1 comment-line update on the surviving cas: identity row for D-Maltose_Monohydrate, which documents the bogus CHEBI:233428 reference removal). This vendored sync matches the SSSOM state that MIM PR1 (fix/remove-bad-narrow-match-rows-pr558, also includes commit 16a6527 — Group A validator + CI gate) will publish once merged to MIM main. The next consolidator run will reproduce the same state idempotently from whichever MIM:main commit is current. Once that PR merges and another MIM-driven consolidator pass runs, kg-microbe's KNOWN_BAD_NARROWMATCH filter at consolidate_chemical_mappings.py:1211-1217 becomes redundant — that workaround can be removed in a follow-up PR (mirrors the planned removal of purge_asymmetric_pollution() once MIM PR2's structural invariants land). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5 hardcoded bad-pair entries in consolidate_chemical_mappings.py were filtering MIM rows that have since been corrected upstream. The filter is now redundant at multiple layers (MIM upstream + asymmetric-pollution purge + xref sweep), so this drop removes the local guard and keeps MIM as the single source of truth. Re-ran the consolidator against the freshly-updated MIM SSSOM: - 2017 MIM rows loaded (0 skipped as known-bad — filter no longer applied) - 1881 stale MIM xrefs swept from baseline - 19 stray child-label synonyms + 159 stray MIM xrefs purged from 148 parent records (asymmetric-pollution guard) - 594,970 unified mappings emitted, SSSOM round-trip validation passes - All 67 chemical-mapping tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The remote METPO classes ROBOT template fetched by load_metpo_mappings
pinned berkeleybop/metpo at the 2026-03-24 tag, which meant a curator
edit to fix a label→METPO-ID mapping required either bumping the tag or
waiting on a new METPO release. This adds a final overlay step that
reads `kg_microbe/transform_utils/metatraits/mappings/metpo_alias_mappings.tsv`
(67 high-confidence ManualMappingCuration rows) and updates the in-memory
mapping dict so curator edits take effect on the next transform run.
Trust policy mirrors the BacDive isolation-source loader:
- mapping_justification == 'semapv:ManualMappingCuration', AND
- confidence in {'high', 'medium'}
Rows pointing at unminted METPO IDs (proposed-but-not-yet-released) are
skipped with INFO logging — those keep flowing through the kgmicrobe.*
placeholder path which is the correct destination until upstream lands
the proposal. Both raw and normalized label keys are emitted so case-
mismatched callers find the override.
Tests: 4 new unit tests in tests/test_metpo_alias_overrides.py exercise
the helper in isolation (no network) by stubbing the METPO tree with a
minimal node set. All 67 rows round-trip cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two semantic fixes for transforms that emitted nodes with the wrong biolink role: 1. madin_etal substrate/quality partition (madin_etal.py) Madin et al's environments.csv ENVO_ids column conflates ENVO substrates with PATO qualities for compositional habitats like "rock_deep" → ["ENVO:00001995 rock", "PATO:0001596 increased depth"]. The transform was emitting one organism→location_of edge per CURIE, so PATO qualities ended up as locations of organisms (~569 such edges before fix). New `_partition_substrate_quality_curies()` helper splits substrates from qualities; substrates anchor organism→ location_of edges, qualities attach to those substrates via a new biolink:has_attribute / RO:0000086 has_quality predicate. PATO nodes are emitted with biolink:PhenotypicQuality category. Adds HAS_QUALITY_RELATION / HAS_QUALITY_PREDICATE constants. 2. mediadive medium categorization (mediadive.py + constants.py) Individual mediadive.medium:* nodes were single-cat biolink:GrowthMedium, which flattened the upstream-biolink defined/complex distinction. Now multi-cat per the medium's complex_medium_type flag: - defined: biolink:GrowthMedium|biolink:ChemicalMixture - complex: biolink:GrowthMedium|biolink:ComplexMolecularMixture The medium-type parent nodes get the matching biolink-only category: - mediadive.medium-type:defined → biolink:ChemicalMixture - mediadive.medium-type:complex → biolink:ComplexMolecularMixture Also fixes a P1-P10 orphan bug surfaced by the new kg-path-review `orphan-edges` archetype: when a medium has no SOLUTIONS_KEY in its detail JSON, the loop continues past the medium-node-emission point while still having emitted the subclass_of edge. P1-P10 pharmacopoeial media survived to the merged KG with biolink:NamedThing fallback and empty names. Fix moves the medium node row write to right after the medium-type edge so it is never skipped. Tests: - tests/test_madin_pato_partition.py (5 tests): canonical rock_deep split, pure-substrate row, multi-substrate-with-quality cross-product, PATO-only edge case, unknown-prefix-treated-as-substrate. Affected transforms: madin_etal, mediadive (rerun before re-merging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kg-path-review (kg_path_review.py + SKILL.md):
- New `family-mismatch` archetype: flags edges whose subject prefix is
in {PATO, UO, METPO} when the predicate is biolink:location_of /
biolink:has_part. Mirrors DISALLOWED_OBJECT_SOURCES in the BacDive
trust filter. Catches the bug class fixed in this PR session
(PATO-as-organism-location from BacDive and madin_etal).
- New `orphan-edges` archetype: per-transform endpoint integrity check.
Cross-transform-supplied prefixes (CHEBI/ENVO/UBERON/etc. that the
ontologies transform fills in at merge) are filtered by default to
keep signal-to-noise high; `--include-cross-transform` opts in.
Surfaced the mediadive P1-P10 orphan bug fixed in the previous commit.
- New `_list_transform_dirs()` helper filters merge-snapshot dirs
(`merged_*`, `merged-*`) from aggregate archetypes — fixes the
triple-counting that caused fake CRITICAL cardinality findings
earlier in the session.
- `warn_if_stale_merge()` runs before every archetype and prints a
stderr warning when merged-kg.tar.gz is older than any transform
output. Catches the staleness pitfall hit twice this session.
- `false-majority` proxy: refined regex to skip canonical polarity
trait labels (gram negative, catalase positive, oxidase variable,
etc.) — without this, 36k legitimate gram-negative organism edges
flooded the report. Documented the proxy's label-shaped-only
limitation.
- New CLI flags: `--include-cross-transform`, `--max-rows`.
- SKILL.md: new "Operational gotchas" section pinning the four
recurring pitfalls (stale builds, snapshot dirs, gram-negative as
positive trait, PATO-as-location). Walk example updated to
kgmicrobe.strain (BacDive's actual strain CURIE prefix; NCBITaxon
references in BacDive go DOWN to strains via location_of, not the
other way around).
kg-model-review (SKILL.md):
- Documented multi-category nodes (e.g.
METPO:1001000|biolink:Procedure on kgmicrobe.assay nodes; the
reviewer accepts any pipe-split component being valid).
- Added biolink:Procedure, biolink:PhenotypicQuality to recognized
categories with usage notes.
- Added biolink:has_attribute to recognized predicates (used by the
new madin_etal substrate-quality fix).
chemical-mapping (SKILL.md):
- Renamed every reference to the unified file from
`unified_ingredient_mappings.sssom.tsv.gz` to
`kgmicrobe_unified_entity_mappings.sssom.tsv.gz` (was stale since
commit b132be6). 6 occurrences updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the working configuration from CultureBotAI/MicroGrowLink: runs `anthropics/claude-code-action@v1` on every PR open/sync/reopen, loading the `code-review@claude-code-plugins` plugin from the anthropics/claude-code marketplace and dispatching `/code-review:code-review <owner>/<repo>/pull/<num>` as the prompt. Requires repo-level secret CLAUDE_CODE_OAUTH_TOKEN to be configured. Without it the workflow will fail at the step but won't block the existing kg-microbe QC checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshape multi-line docstrings to comply with the project's pydocstyle convention: opening triple-quote on its own line, then blank summary + body or single-line summary fits in <=120 chars. Also covers two helper docstrings in kg_microbe/utils/mapping_file_utils.py that the ruff CI flagged on PR #558 build (3.10/3.11/3.12). Tests: 82 pass after reshape; ruff check kg_microbe/ tests/ clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.