Problem
Schema fields currently use raw names (title, summary, organism) with no cross-schema identity. This causes two concrete problems:
-
Index configuration is brittle: Vector/keyword index backends configure embedding fields by raw name (fields: [title, summary, organism]). When a new convention uses different field names for the same concept (e.g. dataset_title vs title), those records don't match the index config and get skipped silently.
-
No cross-convention querying: Search can't say "find all records with a title containing X" across conventions that name the title field differently. There's no shared vocabulary for what a field is.
The root issue: Schema fields carry a reference to a value ontology (what values are allowed for term-type fields), but no reference to a field identity (what the field itself represents).
Design
Dublin Core as a field identity ontology
Dublin Core is the standard vocabulary for metadata field identity. It defines 55 properties (dc:title, dc:description, dc:subject, dc:creator, dc:date, etc.) plus 27 classes and some encoding schemes — 103 terms total.
Key insight: DC can be modeled as an Ontology in OSA's existing infrastructure. It's just an ontology whose terms describe metadata fields rather than scientific concepts. Same data model, same APIs, same import pipeline.
Schema fields carry dual references
A Schema field currently has an optional ontology reference for term-type fields (value identity). Add an optional field_ref that points to a DC ontology term (field identity):
{
"name": "dataset_title",
"type": "text",
"field_ref": "dc:title",
"required": true
}
{
"name": "tissue",
"type": "term",
"field_ref": "dc:subject",
"ontology": "urn:osa:localhost:onto:uberon@1",
"required": true
}
field_ref → DC ontology term (what the field IS — "this is a title")
ontology → value ontology for term-type fields (what values are allowed — "values from UBERON")
field_ref is optional — domain-specific fields that don't map to DC still work, they just won't get cross-schema identity
Index config uses DC terms
Instead of raw field names that break across conventions:
# Before (brittle)
indexes:
- name: vector
config:
embedding:
fields: [title, summary, organism]
The index resolves fields by identity:
# After (cross-schema)
indexes:
- name: vector
config:
embedding:
fields: [dc:title, dc:description, dc:subject]
The system resolves dc:title → the actual field name in each convention's schema (e.g. title in one, dataset_title in another). Records from any convention get indexed correctly.
Schema creator UX
A schema creator UI would present:
- Field keys: sourced from DC terms (typeahead/dropdown with
dc:title, dc:creator, etc.)
- Field values: sourced from value ontologies for
term-type fields (UBERON, NCBI Taxonomy, etc.)
- Freeform fields: allowed — just don't get a
field_ref, so they won't participate in cross-schema identity
Implementation
1. Seed Dublin Core as an ontology
- Parse the official RDF/Turtle file (103 terms, trivially small)
- Create an Ontology with SRN
urn:osa:localhost:onto:dublin-core@1
- Each DC property becomes a Term (e.g.
dc:title, dc:description, dc:subject)
- Ship as a built-in seed ontology, always available
2. Add field_ref to Schema fields
- Add optional
field_ref: str to the Schema field model (references a term ID in the DC ontology)
- Validate on schema creation that
field_ref points to a valid DC ontology term
- Update spreadsheet template generation to include field identity metadata
3. Update index resolution
- Index config accepts DC terms in
fields list
- At indexing time, resolve DC terms → actual field names via the record's convention → schema → field mappings
- Fallback: raw field names still work for backward compatibility
4. Convention-aware field mapping
- Convention already references a Schema
- Schema fields with
field_ref create a mapping: dc:title → "dataset_title" for that convention
- Index backends use this mapping when ingesting records
Out of scope
- Full ontology import pipeline (UBERON, NCBI Taxonomy via OWL/OBO) — separate issue
- Schema creator UI — separate frontend issue
- Migrating existing conventions to use
field_ref (can be done incrementally)
- DC classes/datatypes/encoding schemes (only properties needed initially)
Dependencies
Acceptance criteria
Problem
Schema fields currently use raw names (
title,summary,organism) with no cross-schema identity. This causes two concrete problems:Index configuration is brittle: Vector/keyword index backends configure embedding fields by raw name (
fields: [title, summary, organism]). When a new convention uses different field names for the same concept (e.g.dataset_titlevstitle), those records don't match the index config and get skipped silently.No cross-convention querying: Search can't say "find all records with a title containing X" across conventions that name the title field differently. There's no shared vocabulary for what a field is.
The root issue: Schema fields carry a reference to a value ontology (what values are allowed for
term-type fields), but no reference to a field identity (what the field itself represents).Design
Dublin Core as a field identity ontology
Dublin Core is the standard vocabulary for metadata field identity. It defines 55 properties (
dc:title,dc:description,dc:subject,dc:creator,dc:date, etc.) plus 27 classes and some encoding schemes — 103 terms total.Key insight: DC can be modeled as an Ontology in OSA's existing infrastructure. It's just an ontology whose terms describe metadata fields rather than scientific concepts. Same data model, same APIs, same import pipeline.
Schema fields carry dual references
A Schema field currently has an optional
ontologyreference forterm-type fields (value identity). Add an optionalfield_refthat points to a DC ontology term (field identity):{ "name": "dataset_title", "type": "text", "field_ref": "dc:title", "required": true }{ "name": "tissue", "type": "term", "field_ref": "dc:subject", "ontology": "urn:osa:localhost:onto:uberon@1", "required": true }field_ref→ DC ontology term (what the field IS — "this is a title")ontology→ value ontology forterm-type fields (what values are allowed — "values from UBERON")field_refis optional — domain-specific fields that don't map to DC still work, they just won't get cross-schema identityIndex config uses DC terms
Instead of raw field names that break across conventions:
The index resolves fields by identity:
The system resolves
dc:title→ the actual field name in each convention's schema (e.g.titlein one,dataset_titlein another). Records from any convention get indexed correctly.Schema creator UX
A schema creator UI would present:
dc:title,dc:creator, etc.)term-type fields (UBERON, NCBI Taxonomy, etc.)field_ref, so they won't participate in cross-schema identityImplementation
1. Seed Dublin Core as an ontology
urn:osa:localhost:onto:dublin-core@1dc:title,dc:description,dc:subject)2. Add
field_refto Schema fieldsfield_ref: strto the Schema field model (references a term ID in the DC ontology)field_refpoints to a valid DC ontology term3. Update index resolution
fieldslist4. Convention-aware field mapping
field_refcreate a mapping:dc:title → "dataset_title"for that conventionOut of scope
field_ref(can be done incrementally)Dependencies
Acceptance criteria
field_refpointing to a DC termfield_refvalidated against DC ontology on schema creationdc:title) instead of raw field namesfield_refare indexed identically