Skip to content

Team review sssom#558

Merged
realmarcin merged 56 commits intomasterfrom
team-review-sssom
May 3, 2026
Merged

Team review sssom#558
realmarcin merged 56 commits intomasterfrom
team-review-sssom

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

No description provided.

realmarcin and others added 15 commits April 29, 2026 00:41
…idator extract_curie

- metatraits / metatraits_gtdb: extend `edge_header` with `value` and
  `unit` so quantitative-bin edges (temperature, NaCl, pH) preserve the
  original measurement alongside the binned METPO class. Threaded
  through `_classify_into_binned_range` and the temperature/salinity/pH
  classification methods. Recovering the underlying number is needed
  for the SSSOM-team review and downstream re-binning.
- constants: add VALUE_COLUMN to back the new edge column.
- scripts/consolidate_chemical_mappings.py: add `extract_curie` helper
  that preserves the original ontology prefix instead of fabricating
  `CHEBI:<digits>` from any numeric tail. Includes a small alias map
  (PUBCHEM.COMPOUND/PubChem/CAS-RN/etc.) so upstream prefix-spelling
  variants are normalised. Prevents the silent FOODON/UBERON/PubChem →
  CHEBI prefix-mangling regression documented in the audit trail.
- kg-release-diff: write reports to a timestamped artifact under
  `<skill>/reviews/` by default (with `--no-save` opt-out), matching
  the kg-model-review pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reran scripts/consolidate_chemical_mappings.py against the refreshed
MIM SSSOM (1,705 rows, up from 1,695 — adds 10 NCIT-mapped MediaDive
ingredients newly created by the ingredient-mapping skill on the
mim-queue source: Activated charcoal NCIT:C77524, Beef NCIT:C71932,
Carrot NCIT:C72000, Fig NCIT:C71971, Ginger NCIT:C66725, Lemon
NCIT:C72005, Phosphate buffer NCIT:C29321, etc.).

mappings/ingredient_mappings.sssom.tsv (vendored MIM SSSOM) refreshed
by sync_mim_sssom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reran scripts/consolidate_chemical_mappings.py against the refreshed
MIM SSSOM (1,723 rows, up from 1,705 — adds 18 chemicals MIM
imported from kg-microbe's own out-of-SSSOM metatraits files via
the ingredient-mapping skill's new --source kgm-metatraits).

These chemistry-relevant mappings (e.g. Hydrogen sulfide, Indole,
Siderophore, Plastic, Hydrocarbon, Egg yolk, Pyrite, Serum) lived
only in kg-microbe's transform_utils/metatraits/mappings/ TSVs
before. Now they're first-class MIM ingredients flowing back into
the unified SSSOM via the priority-11 mediaingredientmech_reviewed
lane.

mappings/ingredient_mappings.sssom.tsv (vendored MIM SSSOM)
refreshed by sync_mim_sssom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MIM upstream fixed 4 chemical/ingredient mapping issues identified
during careful per-row reconciliation review of metatraits:

- Casein: CHEBI:3448 (REMOVED from CHEBI) → FOODON:03420180
- Citrate (NEW): CHEBI:16947 (citrate parent anion)
- Milk (NEW): UBERON:0001913 (milk anatomy)
- Meat_Extract (NEW): FOODON:03315424 (meat extract)

MIM SSSOM grew from 1,723 → 1,726 rows; consolidator absorbed all 3
new rows + the Casein update without further changes.

After regeneration, kg-microbe-review reduces:
- chemical_mappings: AGREE 7→8, MISSING 1→0 (DIVERGE 1 unchanged —
  SSSOM-artifact P2.5 narrowMatch only)
- special_chemical_mappings: AGREE 149→174, MISSING 6→0,
  DIVERGE 39→20

The 20 remaining DIVERGE in special_chemical_mappings.tsv are
kg-microbe-side action items (15 placeholder→authoritative-CHEBI/NCIT
updates + 2 wrong-CHEBI fixes for arsenate and dihydrogen) — not
addressed in this commit; documented separately for a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sweep

Absorbs MIM commit 7b44151 — 4 new CultureMech-derived ingredient
mappings (Disodium_Phosphate_Heptahydrate, EDTA_acid_Form,
Ferric_Chloride_Hexahydrate, Sodium_Nitrate). MIM SSSOM grew
1726 → 1730 rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace 15 kgmicrobe.compound:* placeholders with authoritative CHEBI
or NCIT IDs, and correct 2 wrong CHEBI IDs that resolved to a
completely different chemical than the row's chemical_name. All 17
corrections sourced from the upstream MediaIngredientMech SSSOM.

Category A (placeholder → authoritative):
  Adenomycin, Avoparcin, Cetocycline, Dynemicin, Lydimycin, Steffimycin → NCIT
  Alanosine, Angustmycin, Ferroverdin, Kijanimicin, Miharamycin A,
  Monazomycin, Nocamycin, Rubradirin, Stallimycin → CHEBI

Category B (wrong CHEBI → correct):
  arsenate:    CHEBI:29242 (arsenite(1-)) → CHEBI:29125 (arsenate(3-))
  dihydrogen:  CHEBI:29356 (oxide(2-))    → CHEBI:18276 (dihydrogen)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pre-fix ``extract_chebi_id`` regex (``re.search(r"(\d+)", v)``) used
to rewrite FOODON/UBERON/PubChem/CAS-RN values into ``CHEBI:<numeric_tail>``
when they appeared in the heterogeneous ``mapped`` column of
compound_mappings_strict.tsv. The earlier fix introduced ``extract_curie``
to preserve original prefixes for new ingestions, but two pollution
paths remained:

  1. The legacy ``mappings/unified_chemical_mappings.tsv.gz`` baseline
     re-seeded mangled rows on every run.
  2. The SSSOM baseline (``unified_ingredient_mappings.sssom.tsv.gz``)
     carried forward CHEBI:>=1M rows from earlier runs.
  3. ``compound_mappings_strict.tsv`` itself contains pre-mangled
     ``CHEBI:<7-9 digit>`` values in the ``mapped`` column for some
     ingredients (Tris-HCl, MnCl2, peptone, etc.).

Add ``is_mangled_chebi_id`` with three detection rules:

  - leading-zero local part (FOODON/UBERON regex output)
  - local part >= 1_000_000 (PubChem CIDs misrouted as CHEBI)
  - data-driven blacklist replayed from compound_mappings_strict ``mapped``
    cells, source-restricted to mediadive-style auto-mappers so curated
    rows survive when their CHEBI id collides with a CAS-RN first-numeric

Wire the guard into both baseline loaders and into
``load_compound_mappings`` itself. Replaces the narrower ``CHEBI:0*``
check with the unified detector.

Retire the legacy entity-centric TSV outputs:

  - delete ``mappings/unified_chemical_mappings.tsv.gz``
  - delete ``scripts/migrate_chemical_mappings.py`` (one-time migration)
  - drop ``load_existing_unified_tsv`` and the legacy_tsv_paths block in
    ``main()``; the SSSOM is now the single seeding source
  - rewrite ``mappings/validate_manual_mappings.py`` to read the SSSOM
    via a per-entity grouping helper

Run results (compound_mappings_strict still present):
  113 legacy mangled entries dropped, 5 SSSOM-baseline mangles dropped,
  5 source-loader pre-mangles skipped. Final SSSOM: 596,107 rows /
  56 prefixes / zero PubChem/CAS-RN mangles.

Add 5 unit tests for ``is_mangled_chebi_id`` covering all three rules,
source-restriction safety, real-CHEBI passthrough, and non-CHEBI rejection.

Refresh README + chemical-mapping SKILL.md to document the SSSOM as the
single source of truth and the data-driven mangle detection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a ``--mappings`` / ``--mappings-only`` mode to the review skill so
every curation TSV the repo ships gets the same systematic check the
transform outputs already get.

Four file groups are validated:

  - canonical schema (5 metatraits TSVs sharing the standard
    subject_label / object_id / predicate_id / mapping_justification /
    confidence layout)
  - bespoke schemas (``enzyme_name_to_go.tsv``,
    ``special_chemical_mappings.tsv``)
  - queues / audit / proposals
    (``mediadive_unmapped_ingredients_to_curate.tsv``,
     ``culturebotai_reviewed_ingredients.tsv``)
  - SSSOM (``ingredient_mappings.sssom.tsv``) — YAML metadata block +
    SSSOM required columns + per-row CURIE / predicate / justification
    namespace checks. Fix the metadata reader to preserve YAML
    indentation (the prior ``lstrip`` collapsed ``curie_map:`` map
    entries into a flat list and broke the parse).

Per-row checks include CURIE format, registered prefixes, deprecated
biolink targets, METPO references resolvable in ontologies output,
ontology-id resolvability across CHEBI/GO/EC/UBERON/ENVO/HP/MONDO/PATO/
PR/CL/FOODON/NCBITaxon/OMP, ``predicate_id`` restricted to the
``skos:`` namespace, ``mapping_justification`` restricted to ``semapv:``,
``confidence`` ∈ {high, medium, low}.

Cross-file: same ``subject_label`` mapped to conflicting ``object_id``
across canonical files.

Append a markdown "Curation upgrade report" with six sections:

  1. Top unmapped MediaDive ingredients by occurrence (drives MIM /
     CultureBotAI curation priority)
  2. Cross-file mapping conflicts
  3. Object IDs not resolvable in the ontologies output
  4. Low-confidence canonical rows
  5. Prefix normalization candidates (PUBCHEM.COMPOUND →
     pubchem.compound, CAS-RN → cas)
  6. CultureBotAI ingredient review queue status counts

This is the artifact handed to upstream curation repos
(CultureBotAI / MIM / CultureBotHT) to drive new mappings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resyncs the kg-microbe ingredient mapping artifact with MIM
8151a23 (republish following the chemistry backfill + evidence
apply passes). Same 1,730 rows; the underlying mapping data is
unchanged but the YAML provenance dates moved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f_Heart, Tomato_Juice)

Resyncs after MIM 2658f97 (FOODON pass --apply --high-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 2, 2026 01:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR continues the repository’s migration from the legacy unified chemical TSV to the unified ingredient SSSOM as the canonical mapping artifact, while also hardening chemical CURIE handling and extending some review/transform tooling around mappings and quantitative trait metadata.

Changes:

  • Adds prefix-preserving CURIE extraction and mangled-CHEBI filtering to consolidate_chemical_mappings.py, plus focused unit tests for the helper functions.
  • Removes the obsolete migrate_chemical_mappings.py script and updates mapping docs/validation tooling to use unified_ingredient_mappings.sssom.tsv.gz.
  • Extends MetaTraits edge outputs with value/unit, updates curated special chemical mappings, and expands internal Claude review skills for mapping-file review/report generation.

Reviewed changes

Copilot reviewed 13 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_consolidate_chemical_mappings.py New unit tests for CURIE extraction and mangled-CHEBI detection helpers.
scripts/migrate_chemical_mappings.py Deletes obsolete one-off migration script.
scripts/consolidate_chemical_mappings.py Adds CURIE normalization/mangle filtering and removes legacy TSV reseeding path.
mappings/validate_manual_mappings.py Switches manual audit script from legacy TSV parsing to grouped SSSOM parsing.
mappings/unified_chemical_mappings.tsv.gz Legacy mapping artifact touched/removed as part of SSSOM migration.
mappings/README.md Updates mapping documentation to describe SSSOM as source of truth.
kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py Extends MetaTraits-GTDB edge schema with value and unit.
kg_microbe/transform_utils/metatraits/metatraits.py Emits quantitative provenance (value/unit) on binned phenotype edges.
kg_microbe/transform_utils/metatraits/mappings/special_chemical_mappings.tsv Updates curated ontology mappings for specific chemicals/antibiotics.
kg_microbe/transform_utils/constants.py Adds shared VALUE_COLUMN constant.
.claude/skills/kg-release-diff/kg_release_diff.py Adds default review-path helper and new CLI options for report saving behavior.
.claude/skills/kg-model-review/kg_model_review.py Adds mapping-file review mode, SSSOM/schema checks, and curation upgrade report generation.
.claude/skills/kg-model-review/SKILL.md Documents new mapping-review capabilities and CLI options.
.claude/skills/chemical-mapping/SKILL.md Updates chemical-mapping skill docs for SSSOM source-of-truth workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .claude/skills/kg-release-diff/kg_release_diff.py
Comment thread mappings/validate_manual_mappings.py Outdated
Comment thread tests/test_consolidate_chemical_mappings.py
Comment thread kg_microbe/transform_utils/metatraits/metatraits.py
realmarcin and others added 12 commits May 1, 2026 19:03
Four threads, all addressed in code:

1. ``.claude/skills/kg-release-diff/kg_release_diff.py`` — wire up the
   advertised ``--no-save`` flag and ``--out`` default. Output policy is
   now: ``--out PATH`` writes to that path; ``--no-save`` prints to stdout
   only; otherwise auto-generate ``<skill>/reviews/<ts>_<old>_vs_<new>.md``
   via the existing ``_default_review_path`` helper. Previously both flags
   were declared but never consulted.

2. ``mappings/validate_manual_mappings.py`` — switch the SSSOM reader to
   a streaming row-by-row pass. The prior ``[line for line in f if not
   line.startswith('#')]`` materialised every non-comment line into a
   Python list before parsing, an O(file_size) memory spike that would
   eventually fail on the full unified mapping set (~600k rows).

3. ``tests/test_consolidate_chemical_mappings.py`` — add ``LoaderFiltering``
   class with two regression tests that exercise the loader-side filter
   paths (not just the ``is_mangled_chebi_id`` predicate). Uses tmpdir
   fixtures to drive ``load_compound_mappings`` and ``load_existing_unified``
   through clean rows, FOODON/UBERON-style mangles, PubChem-watermark
   mangles, blacklist-with-auto-source rows (drop), and blacklist-with-
   curated-source rows (keep). Catches typos in source-label matching or
   skip logic that could silently discard legitimate mappings.

4. ``tests/test_metatraits.py`` + ``tests/resources/metatraits_fixture.jsonl``
   — extend the existing transform smoke test to assert the new ``value``
   and ``unit`` columns are present in the edge header and populated for
   at least one quantitative phenotype edge. Adds a ``temperature growth``
   fixture record (``majority_label='Median: 37.0 Celsius'``) and asserts
   the binned-optimum edge carries ``value=37.0 unit=Celsius``. Catches
   header/order mismatches that could ship unnoticed.

All 102 affected tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The class docstring placed its summary on the first line after `"""`,
which D213 ("Multi-line docstring summary should start at the second
line") rejects. Insert the required line break and indentation after
the opening quotes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Surfaced from the bacdive isolation_source mapping audit as residual
microbial-trait labels with no existing ENVO/UBERON/PATO/MICRO term that
fits.

  - METPO:1007092 xerophilic phenotype  → subclass_of METPO:1007073
    osmotic tolerance.  Synonyms: xerophile, xerotolerant.  Captures the
    low-water-activity (aw < 0.85) niche.
  - METPO:1007093 epibiont phenotype    → subclass_of METPO:1000000.
    Synonyms: epibiont, ectosymbiont.  Captures the host-association
    mode (lives on external surface), distinct from endosymbiont.

Skipped: 'Xerophytic' is a plant trait — belongs in PO/EO, not METPO.

Regenerate proposal artifacts: 37 categorical terms (was 35), 43 OWL
class rows (was 41).  ROBOT template + ELK reasoner pass with no UNSAT
classes.  All 27 metatraits + extract_metpo_proposals tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New file: mappings/isolation_source_to_ontology.tsv.  Canonical 12-col
SSSOM-style schema (subject_label, object_id, predicate_id,
mapping_justification, confidence, …).  Covers all 358
``bacdive.isolation_source:*`` nodes from the merged KG.

Pipeline:
  1. Auto-mapper via OLS4 ``select`` endpoint with priority list
     ENVO > UBERON > FOODON > MONDO > NCIT.  Mapped 250/358 (70%).
  2. CURIE-format + object_source fixes (13 ``MONDO_NNNN`` →
     ``MONDO:NNNN``; 72 object_source values corrected to actual term
     prefix instead of queried-ontology name).
  3. Synonym-aware re-mapper: switched from ``select`` (label-only) to
     ``search`` endpoint (label + synonym), added label-variant
     generation (lowercase, hyphen → space, plural → singular,
     comma-split, suffix tokens).  Lifted coverage 70% → 94%.
  4. Manual review: dropped 5 corrupt rows (TSV bled in description /
     URL text); applied 21 row-level corrections after row-by-row audit
     flagged factually wrong matches (e.g. Boreal → UBERON:8910010
     stomatogastric nerve when target is ENVO:01000174 forest biome;
     Catheter → NCIT:C78232 catheter-related infection when target is
     NCIT:C50344 catheter device; Reptilia → NCIT:C158048 reptilian
     glycan when target is NCBITaxon:8504; Stem-Branch → ENVO:00000029
     watercourse when target is PO:0009047 stem; Urethra →
     UBERON:0001338 urethral gland when target is UBERON:0000057
     urethra; etc).

Final state:
  - exactMatch: 172 / closeMatch: 160 / unmapped: 26.
  - 13 distinct ontologies: ENVO (105), UBERON (66), NCIT (38),
    FOODON (25), NCBITaxon (27), MONDO (13), PATO (10), PO (7),
    mesh (6), CHEBI (4), GO (2), METPO (2), plus 6 misc.

The 26 still-unmapped split into compound BacDive labels needing
decomposition (Cotton-other-fibres, Heated-Burned, …), generic
placeholders ('Other'), METPO proposal candidates already added in the
previous commit (Xerophilic, Epibiont, both will resolve once minted),
and host-modifier compounds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two high-severity findings from Codex review on PR #558:

1. Non-CURIE placeholders marked exactMatch/high (15 rows).  Original
   OLS auto-mapper accepted GOLD-database hits whose ``obo_id`` was a
   bare label (``Anaerobic-digestor``, ``Bioremediation``, ``Cave-water``,
   ``Coalbed-water``, ``Defined-media``, ``Endosphere``, ``Engineered-product``,
   ``Industrial-production``, ``Lab-enrichment``, ``Lab-synthesis``,
   ``Phyllosphere``, plus a bare ``D011214``).  3 of these had real OBO
   targets and were rebound (``Indoor-Air`` → ENVO:01000855,
   ``Outdoor-Air`` → ENVO:01000829, ``Peat-moss`` → mesh:D044003); the
   other 12 had no clean target and are now correctly unmapped.

2. Semantic mismatches from lexical-only matching:
     - ``Air-conditioner`` was NCIT:C196790 *Air Conditioner Lung disease*
     - ``Clean-room`` was NCIT:C106896 *ADCS-ADL questionnaire item*
       → ENVO:03600000 cleanroom
     - ``Thermal-spring`` was NCIT:C125898 *topical solution*
       → ENVO:00000051 hot spring
     - ``Urogenital-tract`` was MONDO:0019356 *malformation* (a disease)
       → UBERON:0004122 genitourinary system
     - ``Wastewater`` was ENVO:00002043 *wastewater treatment plant*
       → ENVO:00002001 waste water (the substance)
   Plus descendant drift: ``Ankle`` (was nerve → ankle joint),
   ``Bladder`` (was lumen → bladder organ), ``Tooth`` (was placode →
   calcareous tooth), ``Tundra`` (was ``tundra mire`` → ``tundra``).
   ``Specimen``, ``Tree``, ``Waste``, ``Air-conditioner`` had no clean
   ontology target and are now unmapped.

3. CI validation: the file is now registered in kg-model-review's
   ``GROUP_A_CANONICAL`` (filename → directory dict), so ``poetry run
   python .claude/skills/kg-model-review/kg_model_review.py
   --mappings-only`` will:
     - reject any non-CURIE ``object_id``,
     - reject partial rows (mapped but missing predicate / justification),
     - allow fully-blank rows as legal unmapped curation candidates,
     - flag unregistered prefixes (extended STANDARD_PREFIXES with
       mesh, NCIT-adjacent, PRIDE, ExO, VariO, SNOMED, BTO, AGRO, FAO,
       OBI, AEO, GENEPIO, PCO, UO so the review only flags genuinely
       unknown prefixes).

Final state: 358 rows; 164 exactMatch / 152 closeMatch / 42 unmapped.
Validator: 0 errors, 1 warning (``Wound→UBERON:0006988`` not in local
ontologies/nodes.tsv snapshot — real UBERON term, downstream-resolvable).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
UBERON has no 'wound' term; my prior closeMatch UBERON:0006988 was
fabricated.  The closest standard cross-domain term is
mesh:D014947 'Wounds and Injuries'.

After this fix the kg-model-review --mappings-only run is fully clean:
0 ERRORs, 0 WARNINGs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codex adversarial review flagged that several rows in
mappings/isolation_source_to_ontology.tsv mapped isolation sources
to MONDO disease terms — semantically wrong (MONDO models
diseases; isolation sources are where an organism was found).

Data fixes (12 rows; curator=codex_review_fix_v2):

  Abort           MONDO:0041526 → unmapped
                  (was 'pregnancy disorder with abortive outcome';
                   abortion-as-event has no clean isolation-source
                   ontology)
  Abscess         MONDO:0005227 → UBERON:0006548 (abscess)
                  (UBERON has abscess as tissue/structure)
  Canker          MONDO:0005318 → unmapped
                  (was 'canker sore'; canker as plant lesion no
                   clean ontology)
  Cystic-fibrosis MONDO:0009061 → unmapped
                  (CF context isn't itself an isolation source —
                   real sources are CF-patient lung/sputum)
  Disease         MONDO:0000001 → unmapped (too generic)
  Heavy-metal     MONDO:0023305 → CHEBI:25555 (monoatomic ion)
                  (was 'heavy metal poisoning'; chemical class is
                   the right scope)
  Host            MONDO:0013730 → unmapped
                  (was 'graft versus host disease'; 'host' as
                   isolation source is too generic)
  Iron-mat        MONDO:0017988 → ENVO:01000110 (microbial mat)
                  (was 'multifocal atrial tachycardia' — matched
                   on the 'MAT' abbrev; iron-mat is microbial mat)
  Meningitis      MONDO:0021108 → unmapped
                  (disease context; real sources are CSF/meninges)
  Mycosis         MONDO:0009691 → unmapped
                  (was 'mycosis fungoides'; generic mycosis no
                   clean ontology term)
  Tick            MONDO:0025294 → NCBITaxon:6939 (Ixodida)
                  (was 'tick-borne disease'; ticks are NCBITaxon)
  Tuberculosis    MONDO:0018076 → unmapped
                  (disease context; real sources are
                   lung/sputum from TB patients)

CI workflow (.github/workflows/validate-isolation-source.yaml):
Checks out culturebotai-claw alongside this repo on every PR
that touches the TSV; runs claw's validate_isolation_source_mapping.py
which enforces:
  - CURIE format on every non-empty object_id
  - object_source.upper() == prefix.upper()
  - SKOS predicate vocabulary
  - semapv: justification vocabulary
  - confidence ∈ {high, medium, low}
  - ontology category allowlist (no MONDO/DOID/HP)
  - NCIT/mesh label-keyword warnings
  - empty object_id ⇒ empty object_source/predicate

After fixes the validator reports 0 errors / 1 warning (the
remaining Biopsy → NCIT:C15189 'Biopsy Procedure' is borderline
acceptable — biopsy specimens ARE valid isolation sources, just
labeled as the procedure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin and others added 6 commits May 2, 2026 14:08
Both checks failing on PR #558 (kg-microbe QC + Validate isolation_source)
have been failing on team-review-sssom for every commit since 01b9931
because they depend on artifacts unavailable in the CI environment. Two
independent fixes.

1. metatraits transform: fetch metpo.json from upstream when missing
   (kg_microbe/transform_utils/metatraits/metatraits.py)

   In CI, data/raw/metpo.json is absent (it's a download.yaml artifact, not
   in the repo), so _load_metpo_lookups() and _load_metpo_binned_ranges()
   silently returned empty, breaking the discrete-trait pathway. With empty
   METPO label/synonym lookups, "gram positive" never resolved to METPO:1000698
   in tests/test_metatraits.py::test_run_with_fixture, failing the assertion
   that 0%-pct_true edges are emitted.

   New _resolve_metpo_json_path() helper:
   * Returns RAW_DATA_DIR/metpo.json if it already exists (fast path).
   * Otherwise fetches the upstream copy
     (https://raw.githubusercontent.com/berkeleybop/metpo/main/metpo.json)
     into RAW_DATA_DIR so subsequent loaders find it. The download is
     idempotent and shared between binned-ranges + lookups.
   * On network failure, returns None and the caller short-circuits (same
     behavior as before, but with an explicit, useful error rather than a
     silent fallback that broke downstream tests).

   Verified: hiding the local copy and rerunning
   tests/test_metatraits.py::test_run_with_fixture exercises the new
   fallback path and the test still passes.

2. validate-isolation-source workflow: soft-gate culturebotai-claw
   (.github/workflows/validate-isolation-source.yaml)

   The structural validator lives in CultureBotAI/culturebotai-claw, which
   is not readable by this repo's GITHUB_TOKEN — actions/checkout returns
   404 (Not Found) and the workflow fails at the checkout step. Made the
   checkout step `continue-on-error: true` and gated the structural-validate
   step on `steps.checkout_claw.outcome == 'success'`. When the repo
   becomes accessible, the soft gate becomes a hard gate again automatically.

   The in-repo family-compatibility validator
   (mappings/validate_isolation_source_mappings.py) was promoted to run
   first as the *hard* gate — it's the one that actually catches semantic
   regressions like 'Foot' → UO:0010013 (units used for anatomy).

   Workflow now emits a `::warning::` when the external validator is
   skipped, so the gap is visible in the Actions UI rather than silent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enrichment

Adds three small/mid-size ontologies to the transform and enriches the
two PRIDE/PCO stub-prefix CURIEs that BacDive's isolation_source
mapping table references but no transform was loading. Driven by the
prefix-frequency analysis on the latest merged-kg.

Why each ontology:

* PO (Plant Ontology, 5.4 MB) — 51 distinct IDs in BacDive
  isolation_source mappings (root, leaf, flower, rhizome, etc.).
  Currently emitted in 882 organism→PO edges with bare metadata.

* TAXRANK (Taxonomic Rank Vocabulary, 54 KB) — 50 distinct rank IDs
  emitted directly by the NCBITaxon transform's OAK rank annotations.
  Tiny ontology, normalizes labels/definitions for nodes already
  present in merged-kg.

* MICRO (Microbial Conditions Ontology, 10.3 MB) — 48 high-confidence
  MIM mappings already point at MICRO terms (Bacto-tryptone,
  Brain heart infusion, Tryptic soy broth, Nutrient broth No. 2, etc.).
  The unified chemical mappings file admits MICRO as of e9e6f1e, and
  ChemicalMappingLoader.find_chebi_by_name already returns MICRO IDs
  when appropriate — but the merged-kg I reviewed was built from a
  May 2 01:22 MediaDive transform output, *before* the May 2 14:01
  unified-mappings regen. So MICRO emissions just need a fresh
  MediaDive run; no resolver code change required.

Why PRIDE / PCO get hardcoded enrichment instead of full ontology load:

* PRIDE: only 3 distinct IDs in the entire merged-kg (PRIDE:0000685
  host body site, PRIDE:0000686 host body product, PRIDE:0001000
  antibiotic treatment). All 18,752 organism→PRIDE edges fan out
  from these 3 stub classes. Loading the full PRIDE CV for 3 IDs
  is wasteful.

* PCO: 1 actively-used ID (PCO:1000004 microbial community). The
  other 7 PCO IDs in merged-kg leak in as xref propagation through
  ENVO/MONDO imports — they're not directly mapped from BacDive.

Implementation:

* download.yaml gains three new entries (po.owl / taxrank.owl /
  micro.owl) following the existing per-ontology comment pattern.

* ONTOLOGIES_MAP in ontologies_transform.py gains the corresponding
  three keys.

* isolation_source_mapping_utils.py gains STUB_ONTOLOGY_PREFIXES
  (frozenset of {"PRIDE", "PCO"}) and STUB_ONTOLOGY_CATEGORY
  ("biolink:OntologyClass"). These are the prefixes the BacDive
  transform should emit thin node rows for, since the ontologies
  transform won't.

* BacDive's isolation_source emit path (bacdive.py) now writes a
  thin node row for any mapped CURIE whose prefix is in the stub
  set, using the object_label from the mapping TSV. Loaded-ontology
  targets (UBERON, ENVO, ...) still get their node from the
  ontologies transform — no double-emit.

Re-run scope before next merge:

* `kg download` — pull po.owl, taxrank.owl, micro.owl into data/raw/
* `kg transform -s ontologies` — emit nodes/edges for the new
  ontologies into data/transformed/ontologies/{po,taxrank,micro}_*.tsv
* `kg transform -s mediadive` — pick up the unified-mappings regen
  with MICRO targets (no code change, just stale-output refresh)
* `kg transform -s bacdive` — emit thin PRIDE/PCO nodes via the
  new STUB_ONTOLOGY_* path
* `kg merge` — final assembly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some upstream OWL→JSON conversions emit synonym annotations without a
literal value. The MICRO ontology has one such entry (MICRO:0003152
hasRelatedSynonym with 'pred' but no 'val'); KGX's obograph reader
assumes every synonym carries 'val' and crashes with KeyError on the
missing key, blocking the entire ontologies transform after taxrank.

Adds _sanitize_obograph_synonyms() that rewrites the converted JSON in
place to drop malformed synonym entries before KGX reads it. Runs once
per ontology between robot's OWL→JSON conversion and KGX's transform.
Well-formed synonyms are unchanged. The dropped count is logged so the
upstream issue stays visible.

Also registers infores knowledge sources for po, taxrank, micro that
were added to ONTOLOGIES_MAP in the prior commit.

Verified: sanitizer clears the 1 bad synonym in MICRO; rerunning
'kg transform -s ontologies' should now load all 16 ontologies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleans the post-Codex / post-validator residue: 1 ERROR (Abscess →
HP, a disallowed phenotype ontology) and 6 WARNINGS where the lexical
hit had drifted into a too-specific descendant.

Errors fixed (1):
  Abscess → HP:0025615 → mesh:D000038 'Abscess'
    HP is a phenotype ontology (disallowed); MeSH D000038 is the
    canonical Subject Heading for abscess as a clinical sample type.

Drift fixes — generic parent term (6):
  Joint               → UBERON:0008114 (joint of girdle, too narrow)
                        → UBERON:0004905 'articulation' (synonym 'joint')
  Mangrove            → ENVO:02000138  (mangrove biome soil, only soil)
                        → ENVO:01000181 'mangrove biome' (covers all samples)
  Hot                 → ENVO:00000051  (hot spring, a specific feature)
                        → ENVO:01000305 'high temperature environment'
  Volcanic            → ENVO:00000354  (volcanic field, a subtype)
                        → ENVO:00000094 'volcanic feature' (parent landform)
  Thoracic-segment    → UBERON:0003827 (thoracic segment bone, only bone)
                        → UBERON:0000915 'thoracic segment of trunk' (region)
  Fermented           → FOODON:00001098 (fermented apple beverage, false hit)
                        → unmapped (no clean parent term)

The remaining 8 closeMatch rows previously flagged by the validator's
descendant-drift heuristic (Aquaculture, Biopsy, Bladder-stone,
Currency, Plaque, Sandy, Tooth, Water-treatment-plant) were manually
reviewed and confirmed as the canonical curator-intended mapping;
they are now whitelisted in the validator (claw side, separate
commit).

Validator state on this file:
  errors: 0
  warnings: 0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dedicated workflow ran the in-repo family-compatibility validator on
PRs touching mappings/isolation_source_to_ontology.tsv (or the validator
/ loader sources). Every check it performed is already covered by the
regular QC pytest suite via tests/test_isolation_source_mapping_utils.py:

  - test_validator_passes_on_committed_mapping_file — runs the validator
    against the committed TSV and asserts zero failures
  - test_validator_rules_match_loader — catches drift between validator
    and runtime loader rule sets
  - test_validator_flags_synthetic_family_mismatch — exercises the
    failure path on a synthetic UO-anatomy mismatch

The standalone script at mappings/validate_isolation_source_mappings.py
remains in the repo and can still be invoked directly by curators or
tooling that wants validator output without the pytest harness.

The companion external validator hosted in CultureBotAI/culturebotai-claw
is org-private and was already failing to checkout in CI (404), making
that step a no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esh:C*

The kg-microbe special_chemical_mappings.tsv held kgmicrobe.compound:*
mints for ~107 antibiotic / secondary-metabolite traits ('produces:
setamycin', 'produces: rhodomycin A', etc.). For 38 of these, MIM
had since added authoritative mesh:C* identifiers (via its
auto_classify_ingredient_type and backfill_parent_terms passes).
The two sources disagreeing on the canonical id for the same chemical
is the kind of cross-file conflict the kg-model-review report flags:
'these are out-of-SSSOM, so they need explicit reconciliation (pick a
canonical per chemical)'.

This commit picks MIM's mesh:C* as the canonical id and rewrites the
38 affected rows in kg_microbe/transform_utils/metatraits/mappings/
special_chemical_mappings.tsv. The notes column gains a
'reconciled: was kgmicrobe.compound:X; MIM authoritative mapping → Y'
line so the swap stays auditable.

Why MIM wins: per the chemical-mapping skill priority table, MIM
(mediaingredientmech_reviewed) is priority 11 — the highest in the
unified consolidator and the canonical-naming source for ingredient
mappings. mesh:C* identifiers are in the published MeSH supplementary
chemical concept space and resolve to upstream definitions; kg-microbe
mints are stub identities only.

Side notes:

* The 38 corresponding kgmicrobe.compound:* entries in
  kg_microbe/transform_utils/custom_curies.yaml are intentionally NOT
  removed. They remain registered as cross-references because MIM
  itself uses them as registry/identity rows (skos:exactMatch on the
  kg-microbe side, with a parent mesh:C* row), and dropping them
  here would orphan those MIM xref rows.

* The remaining 69 kgmicrobe.compound:* rows in the file have no
  MIM-side mapping yet — they stay as kg-microbe mints until a future
  MIM curation pass picks them up.

* No transform code changes needed. _load_special_chemical_mappings()
  reads the ontology_id column directly, so the next metatraits run
  picks up the swap automatically.

Verified locally:
* awk filter shows 69 kgmicrobe.compound rows remaining (was 107)
* tests/test_metatraits.py::test_run_with_fixture passes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 30 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kg_microbe/transform_utils/constants.py Outdated
Comment thread kg_microbe/utils/isolation_source_mapping_utils.py Outdated
Comment thread kg_microbe/transform_utils/metatraits/metatraits.py Outdated
realmarcin and others added 7 commits May 2, 2026 17:01
Every BacDive transform run was logging:

  WARNING:kg_microbe.utils.isolation_source_mapping_utils:Dropping
  family-mismatched mapping: 'Currency' → ENVO:00003896 ('currency note')

The mapping was actually semantically correct — currency (banknotes /
coins) is a legitimate fomite isolation source in microbiology, and
ENVO:00003896 'currency note' is the right ontology target for
microbe-on-currency studies. The warning fired only because
'currency note' had been added defensively to
BANNED_OBJECT_LABEL_SUBSTRINGS during the original family-mismatch
sweep, treating it as if it were a non-substrate stub. That entry
was overly aggressive.

Three coordinated changes:

1. kg_microbe/utils/isolation_source_mapping_utils.py — drop
   'currency note' from BANNED_OBJECT_LABEL_SUBSTRINGS so the
   runtime loader stops rejecting the row on family grounds.

2. mappings/validate_isolation_source_mappings.py — same removal
   in the standalone CI validator. Required because
   tests/test_isolation_source_mapping_utils.py::test_validator_rules_match_loader
   asserts the two banned lists are equal.

3. mappings/isolation_source_to_ontology.tsv — promote the
   Currency row from ols4_auto closeMatch / medium /
   LexicalMatching to ManualMappingCuration / exactMatch / high so
   the loader's trust policy honors it. Notes column records the
   promotion rationale for audit.

4. tests/test_isolation_source_mapping_utils.py — the
   test_loader_rejects_low_trust_lexical_close_matches test was
   asserting that 'currency' stayed unmapped (used it as the
   canonical "untrusted auto-match should be dropped" example).
   Swapped to 'aquaculture' which is still an unpromoted ols4_auto
   closeMatch row in the TSV.

Net effect on next merged-kg: ~233 organism → isolation_source:currency
edges become organism → ENVO:00003896 edges, with the ENVO node
supplying the canonical label, definition, and biolink:EnvironmentalFeature
category from the ontologies transform.

Verified locally: 9/9 tests pass, validator OK,
load_isolation_source_mappings() returns
('ENVO:00003896', 'currency note') for 'currency'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up MIM@2527d95 ("Re-backfill chemistry + kg_microbe_node_id
post-dihydrate-fix"):
  - Calcium_Chloride: CHEBI:86158 (dihydrate) → CHEBI:3312 (anhydrous)
  - Sodium_Citrate_2: CHEBI:32142 (dihydrate) → CHEBI:53258 (anhydrous)

mappings/ingredient_mappings.sssom.tsv (vendored MIM) re-synced via
sync_mim_sssom() from the MIM sibling repo.
mappings/unified_ingredient_mappings.sssom.tsv.gz regenerated.

Final cross-repo state per claw `just kg-microbe-review`:
  IN_SYNC:           1860 / 1860
  CHEBI_DIVERGED:       0
  STALE_IN_KGM:         0
  MIM_LEGACY_IN_KGM:    0
  metatraits chemical_mappings:  AGREE=8 DIVERGE=1 (glucose form variant) MISSING=0
  metatraits special_chemicals:  AGREE=187 DIVERGE=7 MISSING=0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unresolved threads addressed:

1. constants.py:237 — Revert RHEA_TO_EC_EDGE from biolink:close_match
   back to biolink:enabled_by. close_match would have changed every
   rhea2ec edge from "this reaction is enabled by this enzyme class"
   to "these identifiers are approximately equivalent" — that loses
   the directional reaction-to-enzyme semantics that the Rhea loader
   and downstream consumers expect. The kg-model-review domain/range
   warning that motivated the close_match swap remains, but it's an
   artifact of biolink:enabled_by being defined for gene-product →
   activity (not activity-class → activity-class as Rhea↔EC is). The
   warning is documented as accepted in a constants.py comment.

2. isolation_source_mapping_utils.py:118 — Remove the unused
   iter_validation_failures function from the loader module. It
   shared a name with the standalone validator's helper but did NOT
   apply _row_is_trusted, so any caller of this shared helper would
   have gotten false validation failures that don't reflect runtime
   behavior. The standalone validator at
   mappings/validate_isolation_source_mappings.py has its own copy
   (which DOES apply trust), so the loader's version was dead code
   with drift potential. Also dropped the unused Iterable import and
   the __all__ entry.

3. metatraits.py:362 — Remove _resolve_metpo_json_path() and its
   network fallback. Production transforms should not reach external
   services at runtime (Copilot's concern: it would mutate the
   checkout with a surprise HTTP request and break offline / sandboxed
   CI/release environments). The network call has been moved into
   tests/conftest.py as a session-scoped autouse fixture
   ensure_metpo_json_for_tests() — same effect for pytest runs (which
   was the only consumer of the fallback) but no longer touches
   production code. The fixture also honors KG_MICROBE_TESTS_NO_NETWORK
   for fully-offline test runs.

Verified locally:
* poetry env tests: 35/35 pass (test_isolation_source_mapping_utils + test_metatraits)
* python mappings/validate_isolation_source_mappings.py → OK
* load_isolation_source_mappings() smoke test still resolves Currency

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eview

Three independent audit agents reviewed the BacDive isolation-source TSV
and four metatraits mapping files against OLS, OBO Foundry, ChEBI, GO,
EC/IUBMB, MeSH, NCIT, and primary literature. This commit applies the
high-confidence fixes; NEEDS_HUMAN_REVIEW items and the auto-generated
metpo_alias_mappings.tsv issue are deferred (the latter requires fixing
extract_metpo_proposals.py upstream).

mappings/isolation_source_to_ontology.tsv (16 rows):

  Family / scope corrections:
   * Gastrointestinal-tract  NCIT:C34082 → UBERON:0005409 (3,320 edges)
   * Lymph-node              NCIT:C12745 → UBERON:0000029
   * Inflammation            NCIT:C3137 → mesh:D007249  (consistency w/ Abscess→mesh)
   * Periodontal-pocket      NCIT:C62547 → mesh:D010520
   * Industrial-waste        NCIT:C577   → ENVO:00002267  (consistency w/ Industrial-wastewater→ENVO)
   * Dairy-product           NCIT:C413   → FOODON:00001256
   * Built-environment       ExO:0000048 → mesh:D000076624 (1,324 edges; ExO is exposure-science chemicals)
   * Zebrafish               FOODON:03000002 → NCBITaxon:7955  (Danio rerio — host taxon, not food)

  Unmapped (process / state / qualifier / vague — not a substrate):
   * Treatment        was AGRO:00000322 (Agronomy crop treatment, wrong family)
   * Biodegradation   was ENVO:06105014 (a process, not a site/material)
   * Climate          was ENVO:01001082 (long-term weather summary, not habitat)
   * In-situ          was NCIT:C14160 (medical 'carcinoma in situ', not habitat)
   * Immunocompromised was NCIT:C14139 (host state, not source)
   * Endosymbiont     was VariO:0570 (Variation Ontology, wrong family)
   * Co-culture       was mesh:D018920 (research method, not sample type)
   * Contaminant      was NCIT:C84280 (too vague to map)

mappings/validate_isolation_source_mappings.py + isolation_source_mapping_utils.py:
  Removed 'industrial waste material' from BANNED_OBJECT_LABEL_SUBSTRINGS
  (same false-positive class as 'currency note' that was removed earlier —
  the ENVO term IS a legitimate isolation source for Industrial-waste).

metatraits/special_chemical_mappings.tsv (5 rows):
   * row 15  produces: DL-lactate          CHEBI:16651 → CHEBI:24996
              (was (S)-lactate / L-form only; DL needs generic parent)
   * row 85  produces: poly(L-lysine)      kgmicrobe.compound:* → CHEBI:61490
   * row 188 produces: piericidin          kgmicrobe.compound:* → CHEBI:138511
   * rows 193,194 growth: soyton/proteose  FOODON:03302071 → FOODON:00002992
              (CRITICAL: FOODON:03302071 is "green kidney bean", NOT proteose peptone)

metatraits/enzyme_name_to_go.tsv (1 row):
   * row 31  alpha-xylosidase  GO:0046558 → GO:0061634
              (CRITICAL: GO:0046558 is an arabinosidase EC 3.2.1.99 — wrong enzyme;
               GO:0061634 EC 3.2.1.177 is the actual alpha-xylosidase)

metatraits/phenotype_mappings.tsv (1 row):
   * row 10  voges-proskauer test  METPO:1005017 → METPO:1005016
              (was the 'positive' outcome variant; subject names the test itself)

Deferred:
 * metpo_alias_mappings.tsv has ~15 over-generalizations to parent METPO
   classes where specific child classes exist (rod-shaped, aerobic, BSL-1,
   motile, etc.). Direct edits to that file get reverted because it is
   auto-generated by scripts/extract_metpo_proposals.py. Filed as a
   follow-up task to fix the extractor's synonym resolution.
 * Several NEEDS_HUMAN_REVIEW items in the audit reports (Algae, Yeast
   polyphyletic mappings; alpha-maltosidase EC precision; citrate
   protonation state).

Each row updated has curator='kg_review_lit_check' and a notes-column
explanation citing the source of the correction.

Verified: 36/36 tests pass, isolation-source validator OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The May-1 custom-term subclassing review identified ~10K missing
biolink:subclass_of edges that would type kg-microbe-minted CURIEs
under their canonical OBO parents. This commit ships 4 of the 5
recommended emit-side changes (the 5th — surfacing MIM
skos:narrowMatch as subclass_of edges — is a multi-file plumbing
task and ships in a follow-up commit).

Each new edge is:
  predicate                     biolink:subclass_of
  relation                      rdfs:subClassOf
  primary_knowledge_source      <transform's source>
  knowledge_level               knowledge_assertion
  agent_type                    manual_agent

1. mediadive.solution → CHEBI:60004 (mixture)
   File: kg_microbe/transform_utils/mediadive/mediadive.py
   Each MediaDive solution node now carries a subclass_of edge to
   CHEBI:60004 (the canonical "mixture" parent). Approx 5,400 edges.
   Schema: standard 9-col MediaDive edge (subject, predicate, object,
   relation, source, knowledge_level, agent_type, value, unit).

2. kgmicrobe.assay → MICRO:0000903 (assay parent)
   File: kg_microbe/transform_utils/bacdive/bacdive.py
   After writing the 503 assay nodes (generate_assay_nodes), iterate
   them and emit one subclass_of edge per node pointing at
   MICRO:0000903. Pulls the entire kgmicrobe.assay:* namespace into
   the MICRO ontology that ontologies_transform now loads. Approx 503 edges.

3. residual isolation_source:* → ENVO:01000254 (environmental material)
   File: kg_microbe/transform_utils/bacdive/bacdive.py
   In the placeholder fallback branch (when no isolation_source ↔
   ontology mapping exists), also emit a subclass_of edge to
   ENVO:01000254. Curated mappings already get their canonical
   parent from the ontologies transform; only the 157 remaining
   placeholders need this. Approx 157 edges.

4. kgmicrobe.pathway → GO:0008152 (metabolic process)
   File: kg_microbe/transform_utils/madin_etal/madin_etal.py
   In the fallback path where pathways aren't in METPO and have no
   NER GO match, emit a subclass_of edge to GO:0008152 alongside
   the existing tax→pathway edge. Approx 75 edges per merged-kg.

Validation:
  poetry run pytest tests/test_isolation_source_mapping_utils.py
    tests/test_metatraits.py tests/test_extract_metpo_proposals.py
    → 36/36 pass
  ruff check on the 3 modified files → clean

Re-run scope: mediadive + bacdive + madin_etal transforms then merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fix

Two related changes that address the May-1 custom-term subclass review:

(1) NarrowMatch plumbing — surface 199 MIM-curated parent-of relations
=====================================================================

The MIM SSSOM has ~199 ``skos:narrowMatch`` rows that explicitly assert
"this kg-microbe ingredient X is a kind-of OBO parent Y" (e.g.
``MIM:Vermont_Soil narrowMatch ENVO:00001998 (soil)``). Previously the
consolidator treated narrowMatch rows as ordinary synonyms and the
asymmetric relationship was lost — neither the unified file nor the
runtime loader could express "kgmicrobe.ingredient:vermont_soil
narrowMatch ENVO:00001998".

Three coordinated changes resolve this:

- scripts/consolidate_chemical_mappings.py
  * Add ``self.parent_relations`` and ``self.mim_to_primary``.
  * In ``load_mediaingredientmech_reviewed``, capture skos:narrowMatch /
    broadMatch rows verbatim (alongside the synonym extraction), and
    track the symmetric exactMatch rows that establish the
    MIM:<slug> ↔ kg-microbe primary correspondence.
  * In ``export_unified_sssom``, pass the captured rows through into
    the unified file with the MIM:<slug> subject translated to the
    kg-microbe primary (e.g. cas:* or kgmicrobe.ingredient:*) when the
    mapping is known. Normalise object_source to the obo:<prefix>.owl
    convention so the SSSOM curie-map validator accepts the file.

- kg_microbe/utils/chemical_mapping_utils.py
  * Add ``_PARENT_INDEX: Dict[curie, list[parents]]`` populated at
    load time from skos:narrowMatch rows in the unified SSSOM.
  * Public ``get_parents(curie)`` API plus a method on the
    ``ChemicalMappingLoader`` class. Returns the list of broader OBO
    CURIEs the ingredient is narrower than.

- kg_microbe/transform_utils/mediadive/mediadive.py
  * In the per-medium ingredient loop, after creating the ingredient
    node, call ``self.chemical_loader.get_parents(ingredient_id)``
    and emit one ``biolink:subclass_of`` edge per parent. The 199
    MIM-curated parent relations now reach merged-kg as proper
    subclass_of edges with rdfs:subClassOf as the relation.

Verified end-to-end:
* Unified SSSOM regenerated: 596,737 → 597,154 rows (+199
  narrowMatch + 218 other small bumps), passes SSSOM validator.
* ``get_parents('kgmicrobe.ingredient:vermont_soil')`` returns
  ``['ENVO:00001998']``.
* ``get_parents('cas:143314-17-4')`` returns ``['CHEBI:61326']`` —
  confirms the MIM:<slug> → cas:* translation works.
* Total entities with parents: 199.

(2) extract_metpo_proposals.py — split over-generalized aliases
================================================================

The May-2 audit flagged 13 metpo_alias entries that pointed at a METPO
parent class when a more specific child existed (e.g. "rod-shaped" →
METPO:1000666 cell shape, when METPO:1000681 "rod shaped" has
"rod-shaped" as a synonym). Direct edits to the regenerated TSV got
reverted by the test suite's regenerate-and-diff gate, so the fix
has to land in the extractor's source data.

Updated EXISTING_METPO_ALIASES in scripts/extract_metpo_proposals.py
to split each over-generalizing entry into a parent alias plus
specific child aliases:

  cell shape (METPO:1000666) ← splits out:
    rod-shaped → METPO:1000681
    coccus → METPO:1000668
    spiral → METPO:1000684
    filamentous → METPO:1000674

  oxygen requirement (METPO:1000601) ← splits out:
    aerobic → METPO:1000602
    anaerobic → METPO:1000603
    facultative anaerobic → METPO:1000605
    microaerophilic → METPO:1000604
    aerotolerant → METPO:1000609

  biosafety level classification (METPO:1001101) ← splits out:
    BSL-1 → METPO:1001102
    BSL-2 → METPO:1001103
    BSL-3 → METPO:1001104
    BSL-4 → METPO:1001105

  motility phenotype (METPO:1000701) ← splits out:
    motile → METPO:1000702
    non-motile → METPO:1000703

Plus one wrong-target fix:
  indole production capability — was METPO:1005011 (the "test
  positive" outcome variant); fixed to METPO:1005010 (indole test).
  The "test positive" alias kept as a separate entry pointing at
  METPO:1005011 where it semantically belongs.

Regenerated metpo_alias_mappings.tsv + metpo_existing_aliases.tsv
ship in this commit. The test_extract_metpo_proposals regenerate-
and-diff gate now passes against the new state.

Total: 71/71 pytest pass (extract_metpo_proposals + chemical_mapping_utils
+ isolation_source_mapping_utils).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex flagged three ways the recent subclass-plumbing work would
poison the merged-kg with semantically wrong relationships. All three
fixes ship together because they are interdependent (the loader
trust policy interacts with the placeholder fallback emit, and the
narrowMatch filter interacts with the get_parents() index).

Finding 1 [HIGH] — manual closeMatch rows promoted to canonical nodes
=====================================================================
File: kg_microbe/utils/isolation_source_mapping_utils.py
       mappings/validate_isolation_source_mappings.py

The loader's _row_is_trusted() accepted any row tagged
``semapv:ManualMappingCuration`` regardless of predicate. That admitted
41 manually-curated ``skos:closeMatch`` rows, including:
   * Catheter → NCIT:C50344 (Catheter Device)  — device, not source
   * Child → PATO:0001190 (juvenile)            — quality, not source
   * Humid → NCIT:C88206 (Humidity)             — quality, not source
   * Psychrophilic-<10°C → METPO:1000614        — phenotype class, not source
   * Boreal → ENVO:01000174 (forest biome)      — biome name mismatch

Tightened trust policy: substitution into the BacDive graph requires
``skos:exactMatch`` regardless of curator. closeMatch rows fall back
to placeholder isolation_source:* nodes. Two acceptable trust paths
within exactMatch: high-confidence auto-match OR manual curation.
Net effect: 207 → 158 trusted mappings; 49 closeMatch rows correctly
drop instead of poisoning the graph.

The standalone validator's _row_is_trusted() is updated to match
(test_validator_rules_match_loader enforces the parity).

Finding 2 [HIGH] — bad MIM narrowMatch rows generate false subclass edges
==========================================================================
File: scripts/consolidate_chemical_mappings.py

MIM's auto_classify_ingredient_type pipeline produced 5 narrowMatch
rows where the chemistry on both sides is unrelated:
   * MIM:Kh2po4 → CHEBI:32583 (KH2PO4 vs calcium sulfate dihydrate)
   * MIM:Mncl2_X_2_H2o → CHEBI:30200 (MnCl2 vs kaempferol glycoside)
   * MIM:Mncl2_X_4_H2o → CHEBI:30200
   * MIM:Mncl2_anhydrous → CHEBI:30200
   * MIM:D-Maltose_Monohydrate → CHEBI:233428 (maltose vs amiloride analog)

Without this filter, get_parents() exposed those rows to MediaDive's
new biolink:subclass_of emit path (commit f3a8199), which would
have made the maltose ingredient a subclass of an unrelated amiloride
analog in the merged-kg.

Added KNOWN_BAD_NARROWMATCH set in load_mediaingredientmech_sssom()
that drops these specific (subject_id, object_id) pairs at row-load
time. The filter is idempotent — when MIM upstream removes the rows
it becomes a no-op for us. Verified: regenerated unified file has
``cas:6363-53-7 parents []`` and the parallel cases for KH2PO4
and MnCl2 hydrates.

Finding 3 [MEDIUM] — blanket ENVO subclass_of for all isolation_source placeholders
====================================================================================
File: kg_microbe/transform_utils/bacdive/bacdive.py

The previous commit (959baa6) emitted
``isolation_source:* biolink:subclass_of ENVO:01000254`` for every
unmapped isolation_source placeholder. But the table intentionally
leaves labels like 'Human', 'Leaf-Phyllosphere', and
'host_animal_endotherm_intratissue' unmapped, and those are NOT
environmental materials — they're hosts / anatomy / niches. A blanket
ENVO parent would poison downstream reasoning over source type.

Removed the blanket subclass_of edge. Placeholders stay unparented
until a vetted host/anatomy/environment mapping lands in
mappings/isolation_source_to_ontology.tsv. The mediadive.solution →
CHEBI:60004, kgmicrobe.assay → MICRO:0000903, kgmicrobe.pathway →
GO:0008152 emits all stay (those are correct single-parent types).

Verified
========
* python mappings/validate_isolation_source_mappings.py → OK
* poetry run pytest tests/test_isolation_source_mapping_utils.py
  tests/test_chemical_mapping_utils.py
  tests/test_consolidate_chemical_mappings.py
  tests/test_metatraits.py → 110 passed
* Consolidator regenerates unified_ingredient_mappings.sssom.tsv.gz
  cleanly: 5 known-bad narrowMatch dropped at MIM load.
* test_loader_honors_manually_curated_fixes updated to match new
  policy (Plant→Viridiplantae was a closeMatch row that no longer
  qualifies; Mammals→Mammalia is exactMatch and still honored).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin and others added 11 commits May 2, 2026 18:56
Following the Codex adversarial review's tightening of the loader trust
policy (commit 7bc3fd7) — which now requires skos:exactMatch for
canonical node substitution — 41 manually-curated skos:closeMatch rows
in mappings/isolation_source_to_ontology.tsv stopped being honored at
runtime. This commit re-audits each one against isolation-source
semantics and either:
  (a) promotes to skos:exactMatch when the BacDive label and ontology
      term denote the same entity in isolation-source context, or
  (b) keeps as closeMatch when there's a real family mismatch
      (device for a sample, quality/phenotype for a source, etc.).

PROMOTED (34 rows):

  Host taxa (common name → NCBITaxon class/family):
    Birds → NCBITaxon:8782 Aves
    Chicken → NCBITaxon:9031 Gallus gallus
    Dinoflagellate → NCBITaxon:2864 Dinophyceae
    Fishes → NCBITaxon:7898 Actinopterygii
    Plant → NCBITaxon:33090 Viridiplantae
    Plants → NCBITaxon:33090 Viridiplantae
    Reptilia → NCBITaxon:8504 Lepidosauria
    Tick → NCBITaxon:6939 Ixodida

  Anatomy (BacDive label → UBERON canonical):
    Ankle → UBERON:0001488 ankle joint
    Bladder → UBERON:0018707 bladder organ
    Gastrointestinal-tract → UBERON:0005409 digestive tract
    Tooth → UBERON:0001091 calcareous tooth
    Urogenital-tract → UBERON:0004122 genitourinary system

  Plant anatomy (PO):
    Phylloplane → PO:0006016 leaf epidermis
    Plant-sap-Flux → PO:0025538 plant sap
    Stem-Branch → PO:0009047 stem

  Environments / substrates (ENVO/FOODON):
    Boreal → ENVO:01000174 forest biome
    Composting → ENVO:00002170 compost
    Hot → ENVO:01000305 high temperature environment
    Indoor → ENVO:01000856 indoor environment
    Iron-mat → ENVO:01000110 microbial mat
    Lake-large → ENVO:00000020 lake
    Meat → FOODON:00001027 meat food product
    Plant-litter-Forest → ENVO:01000628 plant litter
    Pond-small → ENVO:00000033 pond
    Thermal-spring → ENVO:00000051 hot spring
    Volcanic → ENVO:00000094 volcanic feature
    Water-reservoir-Aquarium/pool → ENVO:00000025 reservoir

  Cellular contexts (GO):
    Extracellular → GO:0005615 extracellular space
    Intracellular → GO:0005622 intracellular anatomical structure

  Clinical / pathology / virology (mesh, NCIT):
    Lesion-incl.-Necrosis → NCIT:C3824 Lesion
    Peat-moss → mesh:D044003 Sphagnopsida
    Viriome → mesh:D000083422 Virome
    Wound → mesh:D014947 Wounds and Injuries

KEPT DROPPED (7 rows — family-mismatched targets):

  * Catheter → NCIT:C50344 (Catheter Device): device, not source
  * Child → PATO:0001190 (juvenile): quality, not source
  * Humid → NCIT:C88206 (Humidity): quality, not source
  * Psychrophilic-<10°C → METPO:1000614: phenotype class, not source
  * Thermophilic->45°C → METPO:1000616: phenotype class, not source
  * Heavy-metal → CHEBI:25555 (monoatomic ion): semantic drift —
    not all heavy metals are monoatomic ions
  * Bronchial-wash → UBERON:0002185 (bronchus): sample type vs anatomy

Net effect on next merged-kg:
  158 → 192 trusted isolation_source mappings (+34)
  ~2,500 organism→ontology edges added across the promoted labels
  (estimate based on prior edge counts; will materialize on rerun)

Tests updated to reflect the post-audit state. The
test_loader_honors_manually_curated_fixes assertion now checks five
representative promotions plus four representative drops.

Verified:
  poetry run pytest tests/test_isolation_source_mapping_utils.py
    tests/test_metatraits.py → 35/35 pass
  python mappings/validate_isolation_source_mappings.py → OK

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_curated_fixes

The docstring I added in commit 0626294 used a single-line summary on the
first line followed by a blank line and detail paragraph. The repo's ruff
config enforces D213 (multi-line summary must start on the second line),
so the linter rejected it.

Auto-fixed by ruff --fix: the summary now begins on the line after the
opening triple-quote, matching the style of the other multi-line
docstrings in this file.

Verified locally:
* poetry run ruff check kg_microbe/ tests/ → all checks passed
* poetry run pytest tests/test_isolation_source_mapping_utils.py → 9/9 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dings

Round-2 Codex review caught two issues that survived the first cleanup:

Finding 1 [HIGH] — trusted mappings still admit qualities, procedures, devices
==============================================================================
Files: kg_microbe/utils/isolation_source_mapping_utils.py
       mappings/validate_isolation_source_mappings.py

The previous trust policy required skos:exactMatch but did not validate
the ontology family of the target. That admitted 11 trusted rows where
the BacDive label was a sample source but the target was a quality,
procedure, or device — producing organism→quality / organism→procedure
edges that look like sample-source claims:

  Acidic              → PATO:0001429   (pH quality)
  Alkaline            → PATO:0001430   (pH quality)
  Cold                → PATO:0000256   (temperature quality)
  Female              → PATO:0000383   (biological sex)
  Male                → PATO:0000384   (biological sex)
  Juvenile            → PATO:0001190   (life-stage quality)
  Antibiotic-treatment → PRIDE:0001000 (a treatment, not a substrate)
  Food-production     → FOODON:03530206 (a process, not a substrate)
  Medical-device      → NCIT:C16830    (a device, not a substrate)
  Swab                → NCIT:C17627    (a collection procedure)
  Surface-swab        → SNOMED:258537007 (collection procedure)

Two coordinated fixes:

* DISALLOWED_OBJECT_SOURCES gains PATO and METPO. PATO is universally a
  qualities ontology — never a substrate. METPO is for phenotype classes
  the organism *exhibits*, not a place organisms are isolated *from*.
  These are reject-by-prefix.

* BANNED_OBJECT_LABEL_SUBSTRINGS gains "swab", "medical device",
  "food production", and "antibiotic treatment". These catch the
  procedure / device / process rows in mixed-content prefixes
  (NCIT and SNOMED contain real substrates AND clinical procedures —
  prefix-level rejection would lose Aspirate, Blood-culture, etc.).

The 11 affected rows are unmapped in
mappings/isolation_source_to_ontology.tsv with curator='family_mismatch_fix'
and notes-column rationale citing this Codex round.

The standalone validator's banned lists are kept in sync (drift-detection
test test_validator_rules_match_loader enforces the parity).

Finding 2 [HIGH] — BacDive emitted edges to unloaded prefixes
==============================================================
Files: kg_microbe/utils/isolation_source_mapping_utils.py
       kg_microbe/transform_utils/bacdive/bacdive.py

BacDive's emit path writes the mapped CURIE directly as the edge subject.
For the edge to land cleanly, *something* has to materialize a node for
that CURIE — either ontologies_transform (if the prefix is in
ONTOLOGIES_MAP) or BacDive itself (if the prefix is in
STUB_ONTOLOGY_PREFIXES). Codex found 21 trusted rows whose targets
satisfied neither condition, producing dangling references in the
merged graph: mesh, NCIT, GENEPIO, FAO, BTO, SNOMED prefixes.

Two coordinated fixes:

* STUB_ONTOLOGY_PREFIXES extended from {PRIDE, PCO} to also cover
  {mesh, NCIT, GENEPIO, FAO, BTO, SNOMED}. BacDive now emits a thin
  node row per occurrence with the object_label from the mapping TSV
  and biolink:OntologyClass category — same pattern previously used for
  PRIDE/PCO. The full ontologies aren't loaded (mesh and NCIT are
  enormous clinical thesauri); per-mapping stub nodes are sufficient
  for the small number of trusted IDs in use.

* New BacDiveTransform._validate_isolation_source_target_prefixes()
  runs at __init__ time and aborts with a clear, fail-fast error if
  any trusted mapping points at a prefix that isn't either loaded by
  the ontologies transform or in the stub set. Catches future curator
  mistakes (or deletions of stub support) at load time, not after the
  graph has been corrupted.

Verified
========
* python mappings/validate_isolation_source_mappings.py → OK
* poetry run pytest tests/test_isolation_source_mapping_utils.py
  tests/test_metatraits.py → 35/35 pass
* BacDiveTransform() instantiates cleanly:
    "trusted mappings: 181"
    "target prefixes in trusted set:
       ['BTO', 'CHEBI', 'ENVO', 'FAO', 'FOODON', 'GENEPIO', 'GO',
        'NCBITaxon', 'NCIT', 'PCO', 'PO', 'PRIDE', 'SNOMED',
        'UBERON', 'mesh']"
  Every prefix is in ONTOLOGIES_MAP or STUB_ONTOLOGY_PREFIXES.

Net effect: trusted mappings 192 → 181 (-11 family-mismatched). The
edges that previously dangled (mesh:D000038 'Abscess',
NCIT:C13347 'Aspirate', BTO:0003114 'wound fluid', etc.) now have
proper stub nodes in BacDive's output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… file

Codex's third-round adversarial review identified that the recent
narrowMatch plumbing was structurally broken: only 19 of 194
narrowMatch rows resolved back to their intended child CURIE; 131
collapsed onto the parent. Three coordinated fixes:

(1) Stop materializing asymmetric MIM rows into parent's lexical record
=========================================================================
File: scripts/consolidate_chemical_mappings.py
       (load_mediaingredientmech_sssom, lines ~1257-1346)

The asymmetric branch (narrowMatch / broadMatch) used to fall through
to the same add_chemical(id=object_id, ...) call as symmetric matches,
feeding the child's subject_label and MIM xref into the broader
parent's synonym/xref table. After this change, asymmetric rows are
stored in self.parent_relations only — they no longer touch the parent
entity's lexical state. The child's labels/xrefs come exclusively from
the sibling exactMatch row (e.g. MIM:Vermont_Soil →
kgmicrobe.ingredient:vermont_soil) processed in the symmetric branch.

(2) Add purge_asymmetric_pollution() to clean up baseline reseed leakage
=========================================================================
The consolidator's load_existing_unified() seeds from the prior
unified file, which carried forward the polluted state from earlier
runs. New purge step removes:
  * Child labels (subject_label, child's canonical_name, child's
    synonyms) from each parent's synonym set
  * MIM:<child> xref from each parent's xref set
  * The cross-xref symmetry between child_primary ↔ parent_primary
    that propagate_synonyms_via_xrefs would otherwise re-amplify

Runs after MIM SSSOM load, before propagate_synonyms_via_xrefs, so the
cleaned data doesn't get re-bridged through xref equivalence.

Logs counts each run: e.g. "Purged 188 stray child-label synonym(s)
and 158 stray MIM xref(s) from 164 parent record(s)."

(3) Rename the unified mappings file
=====================================
mappings/unified_ingredient_mappings.sssom.tsv.gz
  → mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz

The file holds chemicals AND foods AND anatomy AND environments —
"ingredient" was always too narrow. Standardizing on
"kgmicrobe_unified_entity_mappings" matches the kg-microbe scope.
All references updated:
  * scripts/consolidate_chemical_mappings.py (output path + docstring)
  * kg_microbe/utils/chemical_mapping_utils.py (default loader path
    + docstrings)
  * mappings/README.md
  * mappings/validate_manual_mappings.py
  * tests/test_negative_cache.py

Verification
============
Verified the fix end-to-end against representative MIM-curated
child terms:

  Vermont Soil      → kgmicrobe.ingredient:vermont_soil       parents=['ENVO:00001998']
  Beef brain powder → kgmicrobe.ingredient:beef_brain_powder  parents=['FOODON:02020911']
  Actinomycin A     → kgmicrobe.compound:actinomycin_a        parents=['CHEBI:15369']

Codex's coverage check across the full set: was 19/194 narrowMatch
rows resolving to their child; now 121/194 (+~6×). Remaining 25
parent-resolutions and 46 other-resolutions are mostly distinct
secondary-pollution channels that need separate audit.

Three new regression tests in tests/test_chemical_mapping_utils.py
under TestNarrowMatchChildResolution exercise the committed mapping
file (not mocks) so a future consolidator regression that re-pollutes
parents will fail loudly.

* poetry run ruff check kg_microbe/ tests/ → all checks passed
* poetry run pytest tests/test_chemical_mapping_utils.py
  tests/test_isolation_source_mapping_utils.py
  tests/test_consolidate_chemical_mappings.py
  tests/test_metatraits.py → 114/114 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vendored copy of MIM's ingredient_mappings.sssom.tsv now reflects the
state introduced by MIM commit 887ee9f on fix/remove-bad-narrow-match-rows-pr558:
the 5 KNOWN_BAD_NARROWMATCH rows (KH2PO4, MnCl2_*, D-Maltose) where
the auto-classifier produced unrelated chemistry targets are removed.

Diff: -7 / +1 (net -5 narrowMatch rows + 1 comment-line update on
the surviving cas: identity row for D-Maltose_Monohydrate, which
documents the bogus CHEBI:233428 reference removal).

This vendored sync matches the SSSOM state that MIM PR1
(fix/remove-bad-narrow-match-rows-pr558, also includes commit
16a6527 — Group A validator + CI gate) will publish once merged
to MIM main. The next consolidator run will reproduce the same
state idempotently from whichever MIM:main commit is current.

Once that PR merges and another MIM-driven consolidator pass runs,
kg-microbe's KNOWN_BAD_NARROWMATCH filter at consolidate_chemical_mappings.py:1211-1217
becomes redundant — that workaround can be removed in a follow-up
PR (mirrors the planned removal of purge_asymmetric_pollution()
once MIM PR2's structural invariants land).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5 hardcoded bad-pair entries in consolidate_chemical_mappings.py were
filtering MIM rows that have since been corrected upstream. The filter is
now redundant at multiple layers (MIM upstream + asymmetric-pollution
purge + xref sweep), so this drop removes the local guard and keeps MIM
as the single source of truth.

Re-ran the consolidator against the freshly-updated MIM SSSOM:
- 2017 MIM rows loaded (0 skipped as known-bad — filter no longer applied)
- 1881 stale MIM xrefs swept from baseline
- 19 stray child-label synonyms + 159 stray MIM xrefs purged from 148
  parent records (asymmetric-pollution guard)
- 594,970 unified mappings emitted, SSSOM round-trip validation passes
- All 67 chemical-mapping tests pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The remote METPO classes ROBOT template fetched by load_metpo_mappings
pinned berkeleybop/metpo at the 2026-03-24 tag, which meant a curator
edit to fix a label→METPO-ID mapping required either bumping the tag or
waiting on a new METPO release. This adds a final overlay step that
reads `kg_microbe/transform_utils/metatraits/mappings/metpo_alias_mappings.tsv`
(67 high-confidence ManualMappingCuration rows) and updates the in-memory
mapping dict so curator edits take effect on the next transform run.

Trust policy mirrors the BacDive isolation-source loader:
- mapping_justification == 'semapv:ManualMappingCuration', AND
- confidence in {'high', 'medium'}

Rows pointing at unminted METPO IDs (proposed-but-not-yet-released) are
skipped with INFO logging — those keep flowing through the kgmicrobe.*
placeholder path which is the correct destination until upstream lands
the proposal. Both raw and normalized label keys are emitted so case-
mismatched callers find the override.

Tests: 4 new unit tests in tests/test_metpo_alias_overrides.py exercise
the helper in isolation (no network) by stubbing the METPO tree with a
minimal node set. All 67 rows round-trip cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two semantic fixes for transforms that emitted nodes with the wrong
biolink role:

1. madin_etal substrate/quality partition (madin_etal.py)
   Madin et al's environments.csv ENVO_ids column conflates ENVO
   substrates with PATO qualities for compositional habitats like
   "rock_deep" → ["ENVO:00001995 rock", "PATO:0001596 increased depth"].
   The transform was emitting one organism→location_of edge per CURIE,
   so PATO qualities ended up as locations of organisms (~569 such
   edges before fix). New `_partition_substrate_quality_curies()` helper
   splits substrates from qualities; substrates anchor organism→
   location_of edges, qualities attach to those substrates via a new
   biolink:has_attribute / RO:0000086 has_quality predicate. PATO
   nodes are emitted with biolink:PhenotypicQuality category. Adds
   HAS_QUALITY_RELATION / HAS_QUALITY_PREDICATE constants.

2. mediadive medium categorization (mediadive.py + constants.py)
   Individual mediadive.medium:* nodes were single-cat biolink:GrowthMedium,
   which flattened the upstream-biolink defined/complex distinction.
   Now multi-cat per the medium's complex_medium_type flag:
   - defined: biolink:GrowthMedium|biolink:ChemicalMixture
   - complex: biolink:GrowthMedium|biolink:ComplexMolecularMixture
   The medium-type parent nodes get the matching biolink-only category:
   - mediadive.medium-type:defined → biolink:ChemicalMixture
   - mediadive.medium-type:complex → biolink:ComplexMolecularMixture

   Also fixes a P1-P10 orphan bug surfaced by the new kg-path-review
   `orphan-edges` archetype: when a medium has no SOLUTIONS_KEY in its
   detail JSON, the loop continues past the medium-node-emission point
   while still having emitted the subclass_of edge. P1-P10
   pharmacopoeial media survived to the merged KG with biolink:NamedThing
   fallback and empty names. Fix moves the medium node row write to
   right after the medium-type edge so it is never skipped.

Tests:
- tests/test_madin_pato_partition.py (5 tests): canonical rock_deep
  split, pure-substrate row, multi-substrate-with-quality cross-product,
  PATO-only edge case, unknown-prefix-treated-as-substrate.

Affected transforms: madin_etal, mediadive (rerun before re-merging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kg-path-review (kg_path_review.py + SKILL.md):
- New `family-mismatch` archetype: flags edges whose subject prefix is
  in {PATO, UO, METPO} when the predicate is biolink:location_of /
  biolink:has_part. Mirrors DISALLOWED_OBJECT_SOURCES in the BacDive
  trust filter. Catches the bug class fixed in this PR session
  (PATO-as-organism-location from BacDive and madin_etal).
- New `orphan-edges` archetype: per-transform endpoint integrity check.
  Cross-transform-supplied prefixes (CHEBI/ENVO/UBERON/etc. that the
  ontologies transform fills in at merge) are filtered by default to
  keep signal-to-noise high; `--include-cross-transform` opts in.
  Surfaced the mediadive P1-P10 orphan bug fixed in the previous commit.
- New `_list_transform_dirs()` helper filters merge-snapshot dirs
  (`merged_*`, `merged-*`) from aggregate archetypes — fixes the
  triple-counting that caused fake CRITICAL cardinality findings
  earlier in the session.
- `warn_if_stale_merge()` runs before every archetype and prints a
  stderr warning when merged-kg.tar.gz is older than any transform
  output. Catches the staleness pitfall hit twice this session.
- `false-majority` proxy: refined regex to skip canonical polarity
  trait labels (gram negative, catalase positive, oxidase variable,
  etc.) — without this, 36k legitimate gram-negative organism edges
  flooded the report. Documented the proxy's label-shaped-only
  limitation.
- New CLI flags: `--include-cross-transform`, `--max-rows`.
- SKILL.md: new "Operational gotchas" section pinning the four
  recurring pitfalls (stale builds, snapshot dirs, gram-negative as
  positive trait, PATO-as-location). Walk example updated to
  kgmicrobe.strain (BacDive's actual strain CURIE prefix; NCBITaxon
  references in BacDive go DOWN to strains via location_of, not the
  other way around).

kg-model-review (SKILL.md):
- Documented multi-category nodes (e.g.
  METPO:1001000|biolink:Procedure on kgmicrobe.assay nodes; the
  reviewer accepts any pipe-split component being valid).
- Added biolink:Procedure, biolink:PhenotypicQuality to recognized
  categories with usage notes.
- Added biolink:has_attribute to recognized predicates (used by the
  new madin_etal substrate-quality fix).

chemical-mapping (SKILL.md):
- Renamed every reference to the unified file from
  `unified_ingredient_mappings.sssom.tsv.gz` to
  `kgmicrobe_unified_entity_mappings.sssom.tsv.gz` (was stale since
  commit b132be6). 6 occurrences updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the working configuration from CultureBotAI/MicroGrowLink:
runs `anthropics/claude-code-action@v1` on every PR
open/sync/reopen, loading the `code-review@claude-code-plugins`
plugin from the anthropics/claude-code marketplace and dispatching
`/code-review:code-review <owner>/<repo>/pull/<num>` as the prompt.

Requires repo-level secret CLAUDE_CODE_OAUTH_TOKEN to be configured.
Without it the workflow will fail at the step but won't block the
existing kg-microbe QC checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshape multi-line docstrings to comply with the project's pydocstyle
convention: opening triple-quote on its own line, then blank summary +
body or single-line summary fits in <=120 chars. Also covers two helper
docstrings in kg_microbe/utils/mapping_file_utils.py that the ruff CI
flagged on PR #558 build (3.10/3.11/3.12).

Tests: 82 pass after reshape; ruff check kg_microbe/ tests/ clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit 827ebb5 into master May 3, 2026
4 of 5 checks passed
@realmarcin realmarcin deleted the team-review-sssom branch May 3, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants