Group B: Rules B1–B4 — registry-row mandate, no double-typed pairs, asymmetric-only, canonical object_label#4
Open
realmarcin wants to merge 1 commit intomainfrom
Open
Group B: Rules B1–B4 — registry-row mandate, no double-typed pairs, asymmetric-only, canonical object_label#4realmarcin wants to merge 1 commit intomainfrom
realmarcin wants to merge 1 commit intomainfrom
Conversation
…symmetric-only, canonical object_label Extends scripts/validate_sssom_invariants.py with the four B-series structural invariants documented in MAPPING_SEMANTICS.md (PR #3) and specified by the Codex-#558 round-3 hardening: - Rule B1: every MIM:<slug> subject with a skos:narrowMatch / skos:broadMatch row must have a sibling skos:exactMatch row whose object_id matches kgmicrobe.(ingredient|compound):<slug_lc>. Ships warn-only by default; pass --strict-b1 to convert to a hard reject. Staged because the current SSSOM has 162 narrowMatch subjects whose exactMatch row points at cas:/mesh:/NCIT instead of kgmicrobe.* — warn-only lets this validator land alongside the rest of Group B while the kg-microbe registry is being minted out. - Rule B2: at most one row per (subject_id, object_id) pair. Catches double-typed rows like (MIM:X exactMatch Y) + (MIM:X narrowMatch Y). - Rule B3: cross-row variant of B2 — for any narrowMatch/broadMatch pair, no skos:exactMatch may exist for the same (subject, object). - Rule B4: canonical object_label drift. Loads ../kg-microbe/data/transformed/ontologies/<prefix>_nodes.tsv for CHEBI/FOODON/UBERON/ENVO/BTO/MICRO/PATO and rejects rows whose object_label matches neither the canonical name (column 3) nor any pipe-delimited exact synonym (column 7). Warn-and-skip when the transform file is absent — typical CI configuration; B4 does not contribute to exit-2 in that case. Skips terms with empty canonical name and empty synonym set (CHEBI/EBI placeholder nodes). Rule A logic and the SSSOM I/O layer are untouched. Per-row stderr reporting is consolidated across rules; rejects continue to flow into mappings/needs_curator_review.tsv with reject_reason naming the rule. Validates against current SSSOM: - default mode: exit 0 (Rule A: 0, B1: 162 warnings, B2: 0, B3: 0, B4: 0 with one BTO transform-missing warning) - --strict-b1: exit 2 (162 B1 rejects) - just qc-sssom: exit 0 - synthetic test exercising every rule: each rule fires with the correct reject_reason Once strict B1 is enabled in CI, kg-microbe's purge_asymmetric_pollution() and KNOWN_BAD_NARROWMATCH filter become redundant and can be removed in a downstream kg-microbe PR. Refs: MAPPING_SEMANTICS.md §1–§3 (PR #3), Rule A scaffolding (PR #2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Extends the stdlib-only SSSOM validator (scripts/validate_sssom_invariants.py) to enforce the Group B structural invariants (B1–B4) defined in MAPPING_SEMANTICS.md, including a staged warn-only default for B1 with an opt-in strict mode.
Changes:
- Added validators for Rules B1–B4 (registry-row mandate, no duplicate subject/object pairs, asymmetric-only constraints, canonical
object_labelchecks via kg-microbe transforms). - Added
--strict-b1flag and updated reject aggregation/reporting to include multiple rule reasons. - Added warn-and-skip behavior for B4 when kg-microbe transform files are not present.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| f"Rule B1: missing registry row for {subject_id}. A " | ||
| f"{predicate} row requires a sibling 'skos:exactMatch' row " | ||
| f"with object_id matching kgmicrobe.(ingredient|compound):" | ||
| f"{slug}.{other_hint} Mint the registry CURIE and re-emit " |
Comment on lines
+425
to
+433
| def evaluate_rule_b3( | ||
| rows: list[dict[str, str]], | ||
| ) -> Iterator[tuple[int, dict[str, str], str]]: | ||
| """Yield every narrow/broad row whose ``(subject_id, object_id)`` | ||
| pair also appears under ``skos:exactMatch`` in some other row. | ||
| B2 catches this when both rows live in the file with identical | ||
| tuples; B3 is the cross-row variant where the rows might differ in | ||
| columns other than subject_id/object_id.""" | ||
| exact_pairs: set[tuple[str, str]] = set() |
Comment on lines
+70
to
+71
| matches neither the canonical name (column 3) nor any pipe-delimited | ||
| exact-synonym (column 7), the row is rejected. |
Comment on lines
+474
to
+483
| reader = csv.DictReader(f, delimiter="\t") | ||
| if reader.fieldnames is None: | ||
| return out | ||
| for row in reader: | ||
| term_id = (row.get("id") or "").strip() | ||
| if not term_id: | ||
| continue | ||
| name = (row.get("name") or "").strip() | ||
| syn_raw = (row.get("synonym") or "").strip() | ||
| synonyms = { |
Comment on lines
+492
to
+505
| *, | ||
| transforms_present: dict[str, bool] | None = None, | ||
| ) -> Iterator[tuple[int, dict[str, str], str]]: | ||
| """Yield rows whose ``object_label`` doesn't match the canonical | ||
| name or any exact synonym in the local kg-microbe ontology | ||
| transform. Skips rows whose object_id prefix isn't in | ||
| ``B4_PREFIXES`` and skips entirely (yields nothing) for any | ||
| prefix whose transform file is absent — caller has already emitted | ||
| the warn-and-skip stderr line. | ||
|
|
||
| ``transforms_present`` is an optional caller-supplied dict | ||
| populated by load attempts so the caller can decide which prefixes | ||
| to warn about. When None, this function loads transforms lazily | ||
| and silently skips absent ones.""" |
Comment on lines
+674
to
+677
| b1_label = "B1" if args.strict_b1 else "B1(warn-only)" | ||
| rule_summary = f"Rules A, {b1_label}, B2, B3" | ||
| if "Rule B4" in rule_counts or not missing_prefixes: | ||
| rule_summary += ", B4" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends
scripts/validate_sssom_invariants.pywith the four B-series structural invariants documented inMAPPING_SEMANTICS.md(merged via #3) and specified by the Codex-#558 round-3 hardening. Closes the Group B lane that complements PR #2 (Group A: Rule A) and PR #3 (Group C: docs).The validator stays stdlib-only (
csv+argparse), preserves Rule A verbatim, and continues to write rejects tomappings/needs_curator_review.tsvwith areject_reasoncolumn that names the rule.Rules added
Rule B1 — mandatory registry/identity row (MAPPING_SEMANTICS.md §2). Every
MIM:<slug>subject with at least oneskos:narrowMatchorskos:broadMatchrow must have a siblingskos:exactMatchrow whoseobject_idmatches^kgmicrobe\.(ingredient|compound):<slug_lc>$. Without this row a downstream consumer that walks the SSSOM bysubject_labelresolves the MIM child to its OBO parent — the identity-collapse bug Codex review #558 round 3 flagged.Staged rollout: B1 ships warn-only by default. Pass
--strict-b1to convert to a hard reject. Rationale: the current SSSOM has 162 narrowMatch subjects whose exactMatch row points atcas:/mesh:/NCIT:registries instead ofkgmicrobe.*. Those rows partially solve the identity-collapse problem (consumer lookups by CAS-RN don't collapse to the OBO parent) but they don't satisfy the strict contract. Warn-only mode lets the validator implementation ship now; flip--strict-b1(and add it tojust qc-sssom/ the CI workflow) once every narrow/broad subject carries akgmicrobe.*registry exactMatch row.Rule B2 — at most one row per
(subject_id, object_id)pair (MAPPING_SEMANTICS.md §3 Mistake 2). Group every row by(subject_id, object_id); reject every row participating in a group of size > 1. Catches double-typed rows like(MIM:X exactMatch Y) + (MIM:X narrowMatch Y)directly.Rule B3 — asymmetric child→parent only (MAPPING_SEMANTICS.md §1 narrowMatch / §3 Mistake 2). Cross-row variant of B2: any
(subject, object)pair appearing in askos:narrowMatch(orskos:broadMatch) row must NOT also appear in askos:exactMatchrow, even if the rows differ in trivia (whitespace, source tag, comment).Rule B4 — canonical
object_labeldrift (MAPPING_SEMANTICS.md §3 Mistake 4). For every row whoseobject_idprefix is in{CHEBI, FOODON, UBERON, ENVO, BTO, MICRO, PATO}, look up the canonical label in the local sibling kg-microbe checkout at../kg-microbe/data/transformed/ontologies/<prefix_lc>_nodes.tsvand reject rows whoseobject_labelmatches neither the canonical name (column 3) nor any pipe-delimited exact synonym (column 7). When a transform file is absent (typical CI configuration) the validator emits one stderr WARNING and skips B4 entirely — B4 does NOT contribute to exit-2 in that case. Terms with empty canonical name AND empty synonym set are skipped (a few CHEBI placeholder/upper-level nodes have no name in the transform).Validator behavior matrix
--strict-b1exitWARNING:summary linesSingle PR, one file changed:
scripts/validate_sssom_invariants.py(317 → 715 LOC, +441 / -43).Test plan
python3 scripts/validate_sssom_invariants.py→ exit 0, with one B4 warning (BTO transform absent locally) and 162 B1 warn-only entries (the kg-microbe-registry-mint backlog).mappings/needs_curator_review.tsvcontains only the header.python3 scripts/validate_sssom_invariants.py --strict-b1→ exit 2, 162 B1 rejects, 0 B2/B3/B4. The 162 violations break down into ~143 subjects whose only registry exactMatch is a CAS-RN / mesh / NCIT id, plus ~19 subjects with no exactMatch row at all. Each reject_reason names the subject's other exactMatch targets so curators see at a glance whether they need to mint akgmicrobe.*row alongside an existing CAS row, or mint the registry row from scratch.just qc-sssom: exit 0 (uses default warn-only B1).reject_reason. Vermont_Soil with both narrowMatch ENVO row + kgmicrobe.ingredient: registry row → all rules pass.object_idresolves to a transform entry with no name AND no synonyms (e.g.CHEBI:1,CHEBI:8150placeholder upper-level nodes) are skipped to avoid false positives.Next steps after merge
Strict B1 rollout (the actual Codex-#558 fix): mint
kgmicrobe.{ingredient,compound}:<slug_lc>registry rows for the 162 subjects currently in B1 warn-only. The reject_reason output of--strict-b1mode is the worklist — it lists each subject and its existing non-registry exactMatch target so curators know whether to add or to upgrade an existing row. Once minting is complete, flip CI to use--strict-b1(one-line change in.github/workflows/qc-sssom.yamlandjustfile).Remove kg-microbe's compensating filters: once strict B1 is enabled in CI, kg-microbe's
purge_asymmetric_pollution()andKNOWN_BAD_NARROWMATCHfilter become redundant — the SSSOM consumer no longer has to defensively scrub for identity-collapsed rows because the producer guarantees they don't exist. Open a downstream kg-microbe PR to delete those guards.Optional B4 enrichment in CI: at the moment Rule B4 warn-and-skips on every CI run because kg-microbe's ontology transforms aren't checked into its repo. A future iteration could publish the canonical-label tables (or a small subset) as a CI artifact, letting B4 run as a hard gate everywhere.
References
🤖 Generated with Claude Code