Skip to content

Group B: Rules B1–B4 — registry-row mandate, no double-typed pairs, asymmetric-only, canonical object_label#4

Open
realmarcin wants to merge 1 commit intomainfrom
validator/group-b
Open

Group B: Rules B1–B4 — registry-row mandate, no double-typed pairs, asymmetric-only, canonical object_label#4
realmarcin wants to merge 1 commit intomainfrom
validator/group-b

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Extends scripts/validate_sssom_invariants.py with the four B-series structural invariants documented in MAPPING_SEMANTICS.md (merged via #3) and specified by the Codex-#558 round-3 hardening. Closes the Group B lane that complements PR #2 (Group A: Rule A) and PR #3 (Group C: docs).

The validator stays stdlib-only (csv + argparse), preserves Rule A verbatim, and continues to write rejects to mappings/needs_curator_review.tsv with a reject_reason column that names the rule.

Rules added

  • Rule B1 — mandatory registry/identity row (MAPPING_SEMANTICS.md §2). Every MIM:<slug> subject with at least one skos:narrowMatch or skos:broadMatch row must have a sibling skos:exactMatch row whose object_id matches ^kgmicrobe\.(ingredient|compound):<slug_lc>$. Without this row a downstream consumer that walks the SSSOM by subject_label resolves the MIM child to its OBO parent — the identity-collapse bug Codex review #558 round 3 flagged.

    Staged rollout: B1 ships warn-only by default. Pass --strict-b1 to convert to a hard reject. Rationale: the current SSSOM has 162 narrowMatch subjects whose exactMatch row points at cas: / mesh: / NCIT: registries instead of kgmicrobe.*. Those rows partially solve the identity-collapse problem (consumer lookups by CAS-RN don't collapse to the OBO parent) but they don't satisfy the strict contract. Warn-only mode lets the validator implementation ship now; flip --strict-b1 (and add it to just qc-sssom / the CI workflow) once every narrow/broad subject carries a kgmicrobe.* registry exactMatch row.

  • Rule B2 — at most one row per (subject_id, object_id) pair (MAPPING_SEMANTICS.md §3 Mistake 2). Group every row by (subject_id, object_id); reject every row participating in a group of size > 1. Catches double-typed rows like (MIM:X exactMatch Y) + (MIM:X narrowMatch Y) directly.

  • Rule B3 — asymmetric child→parent only (MAPPING_SEMANTICS.md §1 narrowMatch / §3 Mistake 2). Cross-row variant of B2: any (subject, object) pair appearing in a skos:narrowMatch (or skos:broadMatch) row must NOT also appear in a skos:exactMatch row, even if the rows differ in trivia (whitespace, source tag, comment).

  • Rule B4 — canonical object_label drift (MAPPING_SEMANTICS.md §3 Mistake 4). For every row whose object_id prefix is in {CHEBI, FOODON, UBERON, ENVO, BTO, MICRO, PATO}, look up the canonical label in the local sibling kg-microbe checkout at ../kg-microbe/data/transformed/ontologies/<prefix_lc>_nodes.tsv and reject rows whose object_label matches neither the canonical name (column 3) nor any pipe-delimited exact synonym (column 7). When a transform file is absent (typical CI configuration) the validator emits one stderr WARNING and skips B4 entirely — B4 does NOT contribute to exit-2 in that case. Terms with empty canonical name AND empty synonym set are skipped (a few CHEBI placeholder/upper-level nodes have no name in the transform).

Validator behavior matrix

Rule Default exit --strict-b1 exit Reject TSV Notes
A exit 2 on violation (same) yes unchanged from PR #2
B1 warn-only (no exit-2) exit 2 on violation yes (only with strict) per-row stderr WARNING: summary lines
B2 exit 2 on violation (same) yes every row in a duplicated group rejected
B3 exit 2 on violation (same) yes catches cross-row variants
B4 exit 2 on violation when transform present, warn-and-skip when absent (same) yes (when transform present) warns once per missing prefix

Single PR, one file changed: scripts/validate_sssom_invariants.py (317 → 715 LOC, +441 / -43).

Test plan

  • Live SSSOM, default mode: python3 scripts/validate_sssom_invariants.py → exit 0, with one B4 warning (BTO transform absent locally) and 162 B1 warn-only entries (the kg-microbe-registry-mint backlog). mappings/needs_curator_review.tsv contains only the header.
  • Live SSSOM, strict mode: python3 scripts/validate_sssom_invariants.py --strict-b1 → exit 2, 162 B1 rejects, 0 B2/B3/B4. The 162 violations break down into ~143 subjects whose only registry exactMatch is a CAS-RN / mesh / NCIT id, plus ~19 subjects with no exactMatch row at all. Each reject_reason names the subject's other exactMatch targets so curators see at a glance whether they need to mint a kgmicrobe.* row alongside an existing CAS row, or mint the registry row from scratch.
  • just qc-sssom: exit 0 (uses default warn-only B1).
  • Synthetic test cases: hand-crafted SSSOM exercising each rule (in scratch files outside the repo) — every rule fires with the correct reject_reason. Vermont_Soil with both narrowMatch ENVO row + kgmicrobe.ingredient: registry row → all rules pass.
  • B4 empty-canonical handling: rows whose object_id resolves to a transform entry with no name AND no synonyms (e.g. CHEBI:1, CHEBI:8150 placeholder upper-level nodes) are skipped to avoid false positives.

Next steps after merge

  1. Strict B1 rollout (the actual Codex-#558 fix): mint kgmicrobe.{ingredient,compound}:<slug_lc> registry rows for the 162 subjects currently in B1 warn-only. The reject_reason output of --strict-b1 mode is the worklist — it lists each subject and its existing non-registry exactMatch target so curators know whether to add or to upgrade an existing row. Once minting is complete, flip CI to use --strict-b1 (one-line change in .github/workflows/qc-sssom.yaml and justfile).

  2. Remove kg-microbe's compensating filters: once strict B1 is enabled in CI, kg-microbe's purge_asymmetric_pollution() and KNOWN_BAD_NARROWMATCH filter become redundant — the SSSOM consumer no longer has to defensively scrub for identity-collapsed rows because the producer guarantees they don't exist. Open a downstream kg-microbe PR to delete those guards.

  3. Optional B4 enrichment in CI: at the moment Rule B4 warn-and-skips on every CI run because kg-microbe's ontology transforms aren't checked into its repo. A future iteration could publish the canonical-label tables (or a small subset) as a CI artifact, letting B4 run as a hard gate everywhere.

References

🤖 Generated with Claude Code

…symmetric-only, canonical object_label

Extends scripts/validate_sssom_invariants.py with the four B-series
structural invariants documented in MAPPING_SEMANTICS.md (PR #3) and
specified by the Codex-#558 round-3 hardening:

- Rule B1: every MIM:<slug> subject with a skos:narrowMatch /
  skos:broadMatch row must have a sibling skos:exactMatch row whose
  object_id matches kgmicrobe.(ingredient|compound):<slug_lc>. Ships
  warn-only by default; pass --strict-b1 to convert to a hard reject.
  Staged because the current SSSOM has 162 narrowMatch subjects whose
  exactMatch row points at cas:/mesh:/NCIT instead of kgmicrobe.* —
  warn-only lets this validator land alongside the rest of Group B
  while the kg-microbe registry is being minted out.
- Rule B2: at most one row per (subject_id, object_id) pair. Catches
  double-typed rows like (MIM:X exactMatch Y) + (MIM:X narrowMatch Y).
- Rule B3: cross-row variant of B2 — for any narrowMatch/broadMatch
  pair, no skos:exactMatch may exist for the same (subject, object).
- Rule B4: canonical object_label drift. Loads
  ../kg-microbe/data/transformed/ontologies/<prefix>_nodes.tsv for
  CHEBI/FOODON/UBERON/ENVO/BTO/MICRO/PATO and rejects rows whose
  object_label matches neither the canonical name (column 3) nor any
  pipe-delimited exact synonym (column 7). Warn-and-skip when the
  transform file is absent — typical CI configuration; B4 does not
  contribute to exit-2 in that case. Skips terms with empty canonical
  name and empty synonym set (CHEBI/EBI placeholder nodes).

Rule A logic and the SSSOM I/O layer are untouched. Per-row stderr
reporting is consolidated across rules; rejects continue to flow into
mappings/needs_curator_review.tsv with reject_reason naming the rule.

Validates against current SSSOM:
- default mode: exit 0 (Rule A: 0, B1: 162 warnings, B2: 0, B3: 0,
  B4: 0 with one BTO transform-missing warning)
- --strict-b1: exit 2 (162 B1 rejects)
- just qc-sssom: exit 0
- synthetic test exercising every rule: each rule fires with the
  correct reject_reason

Once strict B1 is enabled in CI, kg-microbe's purge_asymmetric_pollution()
and KNOWN_BAD_NARROWMATCH filter become redundant and can be removed in
a downstream kg-microbe PR.

Refs: MAPPING_SEMANTICS.md §1–§3 (PR #3), Rule A scaffolding (PR #2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 06:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the stdlib-only SSSOM validator (scripts/validate_sssom_invariants.py) to enforce the Group B structural invariants (B1–B4) defined in MAPPING_SEMANTICS.md, including a staged warn-only default for B1 with an opt-in strict mode.

Changes:

  • Added validators for Rules B1–B4 (registry-row mandate, no duplicate subject/object pairs, asymmetric-only constraints, canonical object_label checks via kg-microbe transforms).
  • Added --strict-b1 flag and updated reject aggregation/reporting to include multiple rule reasons.
  • Added warn-and-skip behavior for B4 when kg-microbe transform files are not present.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

f"Rule B1: missing registry row for {subject_id}. A "
f"{predicate} row requires a sibling 'skos:exactMatch' row "
f"with object_id matching kgmicrobe.(ingredient|compound):"
f"{slug}.{other_hint} Mint the registry CURIE and re-emit "
Comment on lines +425 to +433
def evaluate_rule_b3(
rows: list[dict[str, str]],
) -> Iterator[tuple[int, dict[str, str], str]]:
"""Yield every narrow/broad row whose ``(subject_id, object_id)``
pair also appears under ``skos:exactMatch`` in some other row.
B2 catches this when both rows live in the file with identical
tuples; B3 is the cross-row variant where the rows might differ in
columns other than subject_id/object_id."""
exact_pairs: set[tuple[str, str]] = set()
Comment on lines +70 to +71
matches neither the canonical name (column 3) nor any pipe-delimited
exact-synonym (column 7), the row is rejected.
Comment on lines +474 to +483
reader = csv.DictReader(f, delimiter="\t")
if reader.fieldnames is None:
return out
for row in reader:
term_id = (row.get("id") or "").strip()
if not term_id:
continue
name = (row.get("name") or "").strip()
syn_raw = (row.get("synonym") or "").strip()
synonyms = {
Comment on lines +492 to +505
*,
transforms_present: dict[str, bool] | None = None,
) -> Iterator[tuple[int, dict[str, str], str]]:
"""Yield rows whose ``object_label`` doesn't match the canonical
name or any exact synonym in the local kg-microbe ontology
transform. Skips rows whose object_id prefix isn't in
``B4_PREFIXES`` and skips entirely (yields nothing) for any
prefix whose transform file is absent — caller has already emitted
the warn-and-skip stderr line.

``transforms_present`` is an optional caller-supplied dict
populated by load attempts so the caller can decide which prefixes
to warn about. When None, this function loads transforms lazily
and silently skips absent ones."""
Comment on lines +674 to +677
b1_label = "B1" if args.strict_b1 else "B1(warn-only)"
rule_summary = f"Rules A, {b1_label}, B2, B3"
if "Rule B4" in rule_counts or not missing_prefixes:
rule_summary += ", B4"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants