Skip to content

feat: add field identity via Dublin Core ontology for cross-schema interoperability #60

@rorybyrne

Description

@rorybyrne

Problem

Schema fields currently use raw names (title, summary, organism) with no cross-schema identity. This causes two concrete problems:

  1. Index configuration is brittle: Vector/keyword index backends configure embedding fields by raw name (fields: [title, summary, organism]). When a new convention uses different field names for the same concept (e.g. dataset_title vs title), those records don't match the index config and get skipped silently.

  2. No cross-convention querying: Search can't say "find all records with a title containing X" across conventions that name the title field differently. There's no shared vocabulary for what a field is.

The root issue: Schema fields carry a reference to a value ontology (what values are allowed for term-type fields), but no reference to a field identity (what the field itself represents).

Design

Dublin Core as a field identity ontology

Dublin Core is the standard vocabulary for metadata field identity. It defines 55 properties (dc:title, dc:description, dc:subject, dc:creator, dc:date, etc.) plus 27 classes and some encoding schemes — 103 terms total.

Key insight: DC can be modeled as an Ontology in OSA's existing infrastructure. It's just an ontology whose terms describe metadata fields rather than scientific concepts. Same data model, same APIs, same import pipeline.

Schema fields carry dual references

A Schema field currently has an optional ontology reference for term-type fields (value identity). Add an optional field_ref that points to a DC ontology term (field identity):

{
  "name": "dataset_title",
  "type": "text",
  "field_ref": "dc:title",
  "required": true
}
{
  "name": "tissue",
  "type": "term",
  "field_ref": "dc:subject",
  "ontology": "urn:osa:localhost:onto:uberon@1",
  "required": true
}
  • field_ref → DC ontology term (what the field IS — "this is a title")
  • ontology → value ontology for term-type fields (what values are allowed — "values from UBERON")
  • field_ref is optional — domain-specific fields that don't map to DC still work, they just won't get cross-schema identity

Index config uses DC terms

Instead of raw field names that break across conventions:

# Before (brittle)
indexes:
  - name: vector
    config:
      embedding:
        fields: [title, summary, organism]

The index resolves fields by identity:

# After (cross-schema)
indexes:
  - name: vector
    config:
      embedding:
        fields: [dc:title, dc:description, dc:subject]

The system resolves dc:title → the actual field name in each convention's schema (e.g. title in one, dataset_title in another). Records from any convention get indexed correctly.

Schema creator UX

A schema creator UI would present:

  • Field keys: sourced from DC terms (typeahead/dropdown with dc:title, dc:creator, etc.)
  • Field values: sourced from value ontologies for term-type fields (UBERON, NCBI Taxonomy, etc.)
  • Freeform fields: allowed — just don't get a field_ref, so they won't participate in cross-schema identity

Implementation

1. Seed Dublin Core as an ontology

  • Parse the official RDF/Turtle file (103 terms, trivially small)
  • Create an Ontology with SRN urn:osa:localhost:onto:dublin-core@1
  • Each DC property becomes a Term (e.g. dc:title, dc:description, dc:subject)
  • Ship as a built-in seed ontology, always available

2. Add field_ref to Schema fields

  • Add optional field_ref: str to the Schema field model (references a term ID in the DC ontology)
  • Validate on schema creation that field_ref points to a valid DC ontology term
  • Update spreadsheet template generation to include field identity metadata

3. Update index resolution

  • Index config accepts DC terms in fields list
  • At indexing time, resolve DC terms → actual field names via the record's convention → schema → field mappings
  • Fallback: raw field names still work for backward compatibility

4. Convention-aware field mapping

  • Convention already references a Schema
  • Schema fields with field_ref create a mapping: dc:title → "dataset_title" for that convention
  • Index backends use this mapping when ingesting records

Out of scope

  • Full ontology import pipeline (UBERON, NCBI Taxonomy via OWL/OBO) — separate issue
  • Schema creator UI — separate frontend issue
  • Migrating existing conventions to use field_ref (can be done incrementally)
  • DC classes/datatypes/encoding schemes (only properties needed initially)

Dependencies

Acceptance criteria

  • Dublin Core ontology seeded with all 55 properties as terms
  • Schema fields accept optional field_ref pointing to a DC term
  • field_ref validated against DC ontology on schema creation
  • Index config can use DC terms (e.g. dc:title) instead of raw field names
  • Index resolution maps DC terms → actual field names via convention's schema
  • Records from different conventions with different field names but same field_ref are indexed identically
  • Raw field names in index config still work (backward compatible)

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationfeatureNew functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions