Skip to content

Replace custom term_validator with linkml-term-validator #2

@cmungall

Description

@cmungall

Problem

The validation pipeline references a communitymech.validators.term_validator module that does not exist (src/communitymech/validators/term_validator.py is missing), while __init__.py tries to import it. The just validate-terms target fails immediately.

Meanwhile, the plan docs and README already mention linkml-term-validator as the intended tool — we should use it directly rather than writing a custom Python wrapper.

What needs to happen

1. Add linkml-term-validator as a dependency

# pyproject.toml
dependencies = [
    ...
    "linkml-term-validator>=0.1.0",
]

2. Add schema-level bindings or dynamic enums

linkml-term-validator validate-data validates against dynamic enums and bindings — LinkML features that constrain which ontology terms are valid in a given slot. The current schema uses a generic Term class with plain string id/label fields, so the validator has nothing to check against.

Options (not mutually exclusive):

Option A: Bindings on descriptor slots — bind each descriptor's term.id to a prefix-constrained enum:

classes:
  TaxonDescriptor:
    attributes:
      term:
        range: Term
        bindings:
          - binds_value_of: id
            range: NCBITaxonEnum

Option B: Dynamic enums with reachable_from — if we want to constrain to specific branches:

enums:
  NCBITaxonEnum:
    reachable_from:
      source_ontology: obo:ncbitaxon
      source_nodes:
        - NCBITaxon:2  # Bacteria
      relationship_types:
        - rdfs:subClassOf

Option C: At minimum, id_prefixes — add id_prefixes to Term subclasses to at least validate the CURIE prefix is correct (e.g. NCBITaxon for taxa, CHEBI for metabolites). This is simpler but only checks prefix, not term existence/labels.

3. Update justfile targets

Replace the broken custom module calls with direct linkml-term-validator CLI:

# Validate ontology terms in data files
validate-terms FILE:
    uv run linkml-term-validator validate-data {{FILE}} -s src/communitymech/schema/communitymech.yaml --labels

validate-terms-all:
    #!/usr/bin/env bash
    for file in kb/communities/*.yaml; do
        echo "\nValidating terms in $file..."
        uv run linkml-term-validator validate-data "$file" -s src/communitymech/schema/communitymech.yaml --labels
    done

# Validate schema-level enum meanings
validate-schema-terms:
    uv run linkml-term-validator validate-schema src/communitymech/schema/communitymech.yaml

4. Fix the broken __init__.py import

Remove the TermValidator import from src/communitymech/validators/__init__.py since we're using the external tool instead of a custom module.

5. Add to QC pipeline

Update the qc target to include term validation:

qc: validate-all validate-terms-all lint test

Context

We just found 5 completely wrong NCBITaxon IDs and 3 label mismatches in EcoFAB_Ring_Trial_SynCom17.yaml through manual OLS verification. Examples:

  • NCBITaxon:69459 was "Dicraspidia" (a plant), not Lysobacter
  • NCBITaxon:164543 was "Opisthoteuthis massyae" (an octopus), not Marmoricola

Automated term validation would have caught all of these.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions