Docs: MAPPING_SEMANTICS.md (predicate semantics + registry pattern)#3
Merged
realmarcin merged 2 commits intomainfrom May 4, 2026
Merged
Docs: MAPPING_SEMANTICS.md (predicate semantics + registry pattern)#3realmarcin merged 2 commits intomainfrom
realmarcin merged 2 commits intomainfrom
Conversation
…attern)
Documents the SSSOM mapping rules that PR1 (Rule A — auto-classifier
token-overlap gate) and PR2 (Rules B1-B4 — structural invariants on
MIM:<slug> rows) enforce. Self-contained reference for curators new
to the kg-microbe codebase, written so the validator's CI rejects
land with an actionable explanation.
MAPPING_SEMANTICS.md (new, 464 lines):
1. Predicate semantics — exact wording for the four SKOS predicates:
skos:exactMatch = MIM:X and Y denote the SAME entity
skos:closeMatch = similar but not identical (don't substitute)
skos:narrowMatch = MIM:X is a kind-of Y; consumers must emit
biolink:subclass_of, NEVER as identity
skos:broadMatch = inverse of narrowMatch
2. Registry/identity row pattern — Vermont_Soil as the worked
example, with the actual two-row TSV from the live SSSOM
showing narrowMatch ENVO:00001998 + exactMatch
kgmicrobe.ingredient:vermont_soil. Explains why the registry
row is the single channel by which downstream consumers
resolve MIM:<slug> without conflating with the OBO parent
(the bug Codex round-3 caught: find_chebi_by_name("Vermont
Soil") returning ENVO:00001998).
3. Common mistakes — four code-fenced TSV examples each named by
the Rule id that catches them:
Rule A: KH2PO4 → CaSO4·2H2O (zero token overlap)
Rule B2: Vermont_Soil double-typed exact + narrow on same Y
Rule B1: narrowMatch with no kgmicrobe.ingredient registry row
Rule B4: 'soils' written instead of canonical 'soil'
4. Curator workflow — three options when CI rejects a row:
(a) fix the YAML and let claw regenerate
(b) park the row in mappings/needs_curator_review.tsv
(c) reject the proposal entirely
Includes a sample REJECT block showing the validator stderr
format the curator will see.
README.md (modified):
+ 1-paragraph "Mapping Semantics" section pointing at the new doc
+ 1-line entry in the Documentation list (canonical doc index)
Forward-references to Rules B1-B4 are intentional — those land in
PR2 and the docs are the contract they satisfy. If PR2 changes a
rule id or scope, MAPPING_SEMANTICS.md needs a follow-up tweak.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds curator-facing documentation for how to interpret and maintain SSSOM mappings in mappings/ingredient_mappings.sssom.tsv, and links it from the project README.
Changes:
- Added
MAPPING_SEMANTICS.mddescribing SKOS predicate semantics, the registry/identity-row pattern, common curation mistakes, and an intended curator workflow. - Updated
README.mdto link to the new mapping-semantics documentation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| README.md | Adds a “Mapping Semantics” section and includes the doc in the documentation index. |
| MAPPING_SEMANTICS.md | New long-form reference for mapping semantics, examples, and curator workflow guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot review surfaced 9 points where MAPPING_SEMANTICS.md and the
README pointer described planned/future validator behaviour as if it
were already enforced. Today only Rule A is implemented in
scripts/validate_sssom_invariants.py; Rules B1–B4 land in the
Group B follow-up PR. Aligning the docs with the current state.
Changes:
(1) README.md:97 — predicate-list consistency. All four SKOS
predicates now carry the `skos:` prefix (was: `skos:exactMatch`
+ bare `closeMatch`/`narrowMatch`/`broadMatch`).
(2) MAPPING_SEMANTICS.md intro — added a "Status of validator rules"
callout explicitly noting that only Rule A is implemented and
enforced; Rules B1–B4 are planned/deferred and described here as
the contract the next validator PR will satisfy.
(3) MAPPING_SEMANTICS.md §2 (registry pattern) — re-worded "enforced
by Rule B1" to "will be enforced by Rule B1 (planned)" and asks
curators to follow the convention by hand until B1 lands.
(4) MAPPING_SEMANTICS.md §3 (common mistakes) — clarifies that today
only Rule A rejects produce entries in needs_curator_review.tsv;
the B-series rule ids are described so the worked TSV examples
and antidotes describe correct curation regardless of validator
coverage.
(5) MAPPING_SEMANTICS.md §3 Rule B4 paragraph — re-tensed all "flags
this when..." / "warning and skips" claims to "*will* flag" /
"will emit a warning" once Group B lands. Notes that the
current validator implements Rule A only.
(6) MAPPING_SEMANTICS.md §4 Option A step 1 — fixed the YAML path
from `data/curated/<status>/<ingredient>.yaml` (does not exist)
to `data/ingredients/mapped/<Slug>.yaml` (the actual layout).
Notes that filenames preserve subject-id case
(`Vermont_Soil`, not `vermont_soil`).
(7) MAPPING_SEMANTICS.md §4 Option A step 3 — replaced
`culturebotai-claw && just build-sssom` (recipe doesn't exist
in MIM) with the correct cross-repo invocation noting the
builder lives in the sibling claw repo.
(8) MAPPING_SEMANTICS.md §4 step 4 — added the equivalent direct
`python3 scripts/validate_sssom_invariants.py` invocation
alongside `just qc-sssom`, since CI uses the direct form.
(9) MAPPING_SEMANTICS.md §"Reading the validator output" —
replaced the synthetic `REJECT Rule A ...` block with the
actual validator stderr format:
FAIL: N row(s) in <file> fail Rule A (auto-classifier ...)
row N: <subject_id> '<subject>' -> <object_id> '<object>' — <reason>
Removed the exit-code 1 "warnings only" claim (no such mode in
the current validator); kept it as planned-when-Group-B-lands.
(10) MAPPING_SEMANTICS.md §"Where the rules live" — corrected the
CI workflow description: it invokes
`python3 scripts/validate_sssom_invariants.py` directly, not
`just qc-sssom`. Both reach the same code; the doc now matches
the workflow YAML verbatim.
No data-file changes; docs-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
MAPPING_SEMANTICS.md— a self-contained reference for curators working onmappings/ingredient_mappings.sssom.tsv. Lands the documentation contract that the validator atscripts/validate_sssom_invariants.py(Rule A, merged in PR #2) and the deferred Group B rules (B1–B4) enforce.This is PR3 of the 3-PR Codex-#558 hardening pass:
What's in MAPPING_SEMANTICS.md (464 lines)
exactMatch,closeMatch,narrowMatch,broadMatch). The narrowMatch definition is verbatim about the Codex-#558 round-3 finding: "Downstream consumers MUST emit this asbiolink:subclass_of(orrdfs:subClassOf), NEVER as identity."MIM:Vermont_Soil → kgmicrobe.ingredient:vermont_soil exactMatchpairs withMIM:Vermont_Soil → ENVO:00001998 narrowMatch. Explains why the registry row is the single channel for downstream consumers to resolveMIM:<slug>to its kg-microbe primary id without conflating with the OBO parent.(MIM:X, ENVO:00001998)under both exactMatch and narrowMatch'soils'written instead of canonical'soil'mappings/needs_curator_review.tsv, (c) reject the proposal entirely. Plus a sampleREJECTblock showing the validator stderr format.What's in README.md
MAPPING_SEMANTICS.mdForward-references
The doc references PR2's Rule B1–B4 by id even though that PR hasn't landed yet — intentional, per the original plan. If PR2 changes a rule id or scope, MAPPING_SEMANTICS.md needs a follow-up tweak. Tracked.
Test plan
🤖 Generated with Claude Code