Skip to content

Docs: MAPPING_SEMANTICS.md (predicate semantics + registry pattern)#3

Merged
realmarcin merged 2 commits intomainfrom
docs/mapping-semantics
May 4, 2026
Merged

Docs: MAPPING_SEMANTICS.md (predicate semantics + registry pattern)#3
realmarcin merged 2 commits intomainfrom
docs/mapping-semantics

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Adds MAPPING_SEMANTICS.md — a self-contained reference for curators working on mappings/ingredient_mappings.sssom.tsv. Lands the documentation contract that the validator at scripts/validate_sssom_invariants.py (Rule A, merged in PR #2) and the deferred Group B rules (B1–B4) enforce.

This is PR3 of the 3-PR Codex-#558 hardening pass:

  • ✅ PR Remove 5 narrowMatch rows where auto-classifier produced unrelated targets #2 (merged) — Group A: 5 bad rows removed + Rule A validator + CI gate
  • 🟡 PR (this one) — Group C: docs covering both A (live) and B1–B4 (deferred)
  • ⏳ PR (future) — Group B: extend the validator with structural invariants (B1 mandatory registry row, B2 no double-typed pairs, B3 narrow≠exact-on-same-target, B4 canonical object_label drift)

What's in MAPPING_SEMANTICS.md (464 lines)

  1. Predicate semantics — explicit, unambiguous definitions for the four SKOS mapping predicates we use (exactMatch, closeMatch, narrowMatch, broadMatch). The narrowMatch definition is verbatim about the Codex-#558 round-3 finding: "Downstream consumers MUST emit this as biolink:subclass_of (or rdfs:subClassOf), NEVER as identity."
  2. Registry/identity row pattern — Vermont_Soil as the worked example, with the actual two-row TSV from the live SSSOM showing how MIM:Vermont_Soil → kgmicrobe.ingredient:vermont_soil exactMatch pairs with MIM:Vermont_Soil → ENVO:00001998 narrowMatch. Explains why the registry row is the single channel for downstream consumers to resolve MIM:<slug> to its kg-microbe primary id without conflating with the OBO parent.
  3. Common mistakes — four code-fenced TSV examples each named by the rule id that catches them:
    • Rule A (live): KH2PO4 → CaSO4·2H2O — auto-classifier zero token overlap
    • Rule B1 (PR coming): missing kgmicrobe.ingredient registry row
    • Rule B2 (PR coming): same (MIM:X, ENVO:00001998) under both exactMatch and narrowMatch
    • Rule B4 (PR coming): 'soils' written instead of canonical 'soil'
  4. Curator workflow — three options when CI rejects a row: (a) fix the YAML and let claw regenerate, (b) park the row in mappings/needs_curator_review.tsv, (c) reject the proposal entirely. Plus a sample REJECT block showing the validator stderr format.

What's in README.md

  • One-paragraph "Mapping Semantics" pointer section after the existing repo layout block
  • One bullet in the Documentation list pointing at MAPPING_SEMANTICS.md

Forward-references

The doc references PR2's Rule B1–B4 by id even though that PR hasn't landed yet — intentional, per the original plan. If PR2 changes a rule id or scope, MAPPING_SEMANTICS.md needs a follow-up tweak. Tracked.

Test plan

  • Manual review of the four predicate definitions for accuracy
  • Vermont_Soil example matches the live SSSOM's two-row pattern
  • Common-mistakes section names every rule by id (A, B1, B2, B4)
  • README points at the doc both prose and in the canonical doc index
  • No data-file changes (docs-only)

🤖 Generated with Claude Code

…attern)

Documents the SSSOM mapping rules that PR1 (Rule A — auto-classifier
token-overlap gate) and PR2 (Rules B1-B4 — structural invariants on
MIM:<slug> rows) enforce. Self-contained reference for curators new
to the kg-microbe codebase, written so the validator's CI rejects
land with an actionable explanation.

MAPPING_SEMANTICS.md (new, 464 lines):
  1. Predicate semantics — exact wording for the four SKOS predicates:
       skos:exactMatch  = MIM:X and Y denote the SAME entity
       skos:closeMatch  = similar but not identical (don't substitute)
       skos:narrowMatch = MIM:X is a kind-of Y; consumers must emit
                          biolink:subclass_of, NEVER as identity
       skos:broadMatch  = inverse of narrowMatch
  2. Registry/identity row pattern — Vermont_Soil as the worked
     example, with the actual two-row TSV from the live SSSOM
     showing narrowMatch ENVO:00001998 + exactMatch
     kgmicrobe.ingredient:vermont_soil. Explains why the registry
     row is the single channel by which downstream consumers
     resolve MIM:<slug> without conflating with the OBO parent
     (the bug Codex round-3 caught: find_chebi_by_name("Vermont
     Soil") returning ENVO:00001998).
  3. Common mistakes — four code-fenced TSV examples each named by
     the Rule id that catches them:
       Rule A:  KH2PO4 → CaSO4·2H2O (zero token overlap)
       Rule B2: Vermont_Soil double-typed exact + narrow on same Y
       Rule B1: narrowMatch with no kgmicrobe.ingredient registry row
       Rule B4: 'soils' written instead of canonical 'soil'
  4. Curator workflow — three options when CI rejects a row:
       (a) fix the YAML and let claw regenerate
       (b) park the row in mappings/needs_curator_review.tsv
       (c) reject the proposal entirely
     Includes a sample REJECT block showing the validator stderr
     format the curator will see.

README.md (modified):
  + 1-paragraph "Mapping Semantics" section pointing at the new doc
  + 1-line entry in the Documentation list (canonical doc index)

Forward-references to Rules B1-B4 are intentional — those land in
PR2 and the docs are the contract they satisfy. If PR2 changes a
rule id or scope, MAPPING_SEMANTICS.md needs a follow-up tweak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 03:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds curator-facing documentation for how to interpret and maintain SSSOM mappings in mappings/ingredient_mappings.sssom.tsv, and links it from the project README.

Changes:

  • Added MAPPING_SEMANTICS.md describing SKOS predicate semantics, the registry/identity-row pattern, common curation mistakes, and an intended curator workflow.
  • Updated README.md to link to the new mapping-semantics documentation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File Description
README.md Adds a “Mapping Semantics” section and includes the doc in the documentation index.
MAPPING_SEMANTICS.md New long-form reference for mapping semantics, examples, and curator workflow guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated
Comment thread MAPPING_SEMANTICS.md Outdated
Comment thread MAPPING_SEMANTICS.md Outdated
Comment thread MAPPING_SEMANTICS.md Outdated
Comment thread MAPPING_SEMANTICS.md Outdated
Comment thread MAPPING_SEMANTICS.md Outdated
Comment thread MAPPING_SEMANTICS.md
Comment thread MAPPING_SEMANTICS.md Outdated
Comment thread MAPPING_SEMANTICS.md Outdated
Copilot review surfaced 9 points where MAPPING_SEMANTICS.md and the
README pointer described planned/future validator behaviour as if it
were already enforced. Today only Rule A is implemented in
scripts/validate_sssom_invariants.py; Rules B1–B4 land in the
Group B follow-up PR. Aligning the docs with the current state.

Changes:

(1) README.md:97 — predicate-list consistency. All four SKOS
    predicates now carry the `skos:` prefix (was: `skos:exactMatch`
    + bare `closeMatch`/`narrowMatch`/`broadMatch`).

(2) MAPPING_SEMANTICS.md intro — added a "Status of validator rules"
    callout explicitly noting that only Rule A is implemented and
    enforced; Rules B1–B4 are planned/deferred and described here as
    the contract the next validator PR will satisfy.

(3) MAPPING_SEMANTICS.md §2 (registry pattern) — re-worded "enforced
    by Rule B1" to "will be enforced by Rule B1 (planned)" and asks
    curators to follow the convention by hand until B1 lands.

(4) MAPPING_SEMANTICS.md §3 (common mistakes) — clarifies that today
    only Rule A rejects produce entries in needs_curator_review.tsv;
    the B-series rule ids are described so the worked TSV examples
    and antidotes describe correct curation regardless of validator
    coverage.

(5) MAPPING_SEMANTICS.md §3 Rule B4 paragraph — re-tensed all "flags
    this when..." / "warning and skips" claims to "*will* flag" /
    "will emit a warning" once Group B lands. Notes that the
    current validator implements Rule A only.

(6) MAPPING_SEMANTICS.md §4 Option A step 1 — fixed the YAML path
    from `data/curated/<status>/<ingredient>.yaml` (does not exist)
    to `data/ingredients/mapped/<Slug>.yaml` (the actual layout).
    Notes that filenames preserve subject-id case
    (`Vermont_Soil`, not `vermont_soil`).

(7) MAPPING_SEMANTICS.md §4 Option A step 3 — replaced
    `culturebotai-claw && just build-sssom` (recipe doesn't exist
    in MIM) with the correct cross-repo invocation noting the
    builder lives in the sibling claw repo.

(8) MAPPING_SEMANTICS.md §4 step 4 — added the equivalent direct
    `python3 scripts/validate_sssom_invariants.py` invocation
    alongside `just qc-sssom`, since CI uses the direct form.

(9) MAPPING_SEMANTICS.md §"Reading the validator output" —
    replaced the synthetic `REJECT  Rule A ...` block with the
    actual validator stderr format:
      FAIL: N row(s) in <file> fail Rule A (auto-classifier ...)
        row N: <subject_id> '<subject>' -> <object_id> '<object>' — <reason>
    Removed the exit-code 1 "warnings only" claim (no such mode in
    the current validator); kept it as planned-when-Group-B-lands.

(10) MAPPING_SEMANTICS.md §"Where the rules live" — corrected the
     CI workflow description: it invokes
     `python3 scripts/validate_sssom_invariants.py` directly, not
     `just qc-sssom`. Both reach the same code; the doc now matches
     the workflow YAML verbatim.

No data-file changes; docs-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit c718245 into main May 4, 2026
1 check passed
@realmarcin realmarcin deleted the docs/mapping-semantics branch May 4, 2026 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants