Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
261 changes: 261 additions & 0 deletions oeps/OEP-0005.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
---
oep: 0005
title: Resource Identifiers
type: technical
status: ideation
authors: Rory Byrne <rory@rory.bio>
created: 2025-12-07
labels: protocol, identifiers
---

# Abstract

This OEP explores identifier schemes for resources in the Open Science Archive protocol. Rather than prescribing a specific format, it establishes the requirements that any identifier scheme must satisfy, surveys existing approaches, and proposes one candidate (Structured Resource Names) for community feedback.

# Motivation

Every resource in OSA—Records, Depositions, Vocabularies, Schemas, Validators—needs an identifier. The choice of identifier scheme has far-reaching consequences for the protocol's usability, longevity, and interoperability.

Scientific data archives present unique challenges:

- **Longevity**: Identifiers may be cited in papers for decades
- **Federation**: Multiple independent nodes must avoid collisions
- **Machine use**: Software needs to parse, route, and validate identifiers
- **Human use**: Developers and researchers need to debug and discuss identifiers

Getting this wrong is costly. Changing identifier schemes after deployment breaks existing references.

# Requirements

Any identifier scheme for OSA should satisfy the following properties:

## Must Have

**Globally unique**: Two resources must never share an identifier, even across independent nodes operated by different organizations.

**Resolvable**: Given an identifier, there must be a defined mechanism to retrieve the resource or its metadata.

**Stable**: Once assigned, an identifier must continue to refer to the same resource. Identifiers should not be reassigned or recycled.

## Should Have

**Human-readable**: Developers should be able to understand what an identifier refers to without dereferencing it. At minimum, identifiers should be pronounceable and not excessively long.

**Type-aware**: The identifier should indicate what kind of resource it refers to (Record, Vocabulary, etc.), enabling validation and routing without network calls.

**Version-aware**: For versioned resources, the identifier should support pinning to a specific version.

**Decentralized minting**: Nodes should be able to create identifiers without coordinating with a central authority.

## Nice to Have

**Persistent across migrations**: If an organization changes its domain or infrastructure, existing identifiers should remain valid.

**Content-addressable**: Identifiers could be derived from content hashes, enabling integrity verification and deduplication.

**Compatible with existing standards**: Alignment with URN, DID, DOI, or other established schemes reduces implementation burden and improves interoperability.

# Existing Approaches

## URLs

```
https://archive.example.org/records/abc123
```

**Pros**: Universal, familiar, directly resolvable, existing tooling.

**Cons**: Conflates identity with location. When domains change, URLs break. No built-in versioning or typing.

**Used by**: Most web APIs, many data repositories.

## DOIs (Digital Object Identifiers)

```
doi:10.1234/abc.5678
```

**Pros**: Designed for persistence, widely adopted in academia, resolver infrastructure exists (doi.org), citable in papers.

**Cons**: Opaque (no type or origin information), requires registration with a DOI agency (cost, bureaucracy), resolution depends on Handle System (centralized).

**Used by**: Academic publishing, Zenodo, Figshare, DataCite.

## URNs (Uniform Resource Names)

```
urn:isbn:978-3-16-148410-0
urn:ietf:rfc:3986
```

**Pros**: W3C/IETF standard, separates naming from resolution, extensible namespace system.

**Cons**: No universal resolution mechanism (each namespace defines its own), requires IANA registration for formal namespaces.

**Used by**: ISBN, IETF RFCs, various domain-specific schemes.

## DIDs (Decentralized Identifiers)

```
did:web:example.org
did:plc:abc123xyz
```

**Pros**: W3C standard, designed for decentralization, supports cryptographic verification, multiple "methods" for different tradeoffs.

**Cons**: Designed for entities (people, organizations) not resources, verbose, emerging ecosystem.

**Used by**: AT Protocol (Bluesky), identity wallets, Verifiable Credentials.

## ARNs (Amazon Resource Names)

```
arn:aws:s3:us-east-1:123456789:bucket/object
```

**Pros**: Proven at scale, encodes region/account/service/resource hierarchy, enables policy-based access control.

**Cons**: AWS-specific, complex syntax, assumes single operator (Amazon).

**Used by**: All AWS services.

## UUIDs

```
550e8400-e29b-41d4-a716-446655440000
```

**Pros**: Trivial to generate, guaranteed unique (v4), no coordination required.

**Cons**: Opaque, no context about resource type or origin, not human-friendly, not directly resolvable.

**Used by**: Databases, internal systems, anywhere uniqueness matters more than readability.

## Content Identifiers (CIDs)

```
bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oca...
```

**Pros**: Derived from content hash, self-verifying, enables deduplication, immutable by design.

**Cons**: Long, not human-readable, requires content to generate identifier, any content change = new identifier.

**Used by**: IPFS, Filecoin, content-addressed storage systems.

# Analysis

| Scheme | Unique | Resolvable | Stable | Readable | Typed | Versioned | Decentralized |
|--------|--------|------------|--------|----------|-------|-----------|---------------|
| URLs | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ |
| DOIs | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| URNs | ✓ | ◐ | ✓ | ◐ | ◐ | ◐ | ◐ |
| DIDs | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
| ARNs | ✓ | ✓ | ✓ | ◐ | ✓ | ✗ | ✗ |
| UUIDs | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ |
| CIDs | ✓ | ◐ | ✓ | ✗ | ✗ | ✗ | ✓ |

✓ = yes, ✗ = no, ◐ = partial/depends

No existing scheme fully satisfies our requirements. This suggests either:

1. Extending an existing scheme (likely URN or DID)
2. Defining a new scheme purpose-built for OSA

# Candidate: Structured Resource Names (SRNs)

One option is to define a URN-based scheme that embeds the properties we need. We propose this as a starting point for discussion, not as a final specification.

## Format

```
urn:osa:{node-id}:{type}:{local-id}[@{version}][#{fragment}]
```

## Components

**`urn:osa:`** — Fixed prefix indicating an OSA identifier.

**`{node-id}`** — The originating node. Options:
- DNS hostname (e.g., `data.imperial.ac.uk`) — simple, enables direct resolution, but breaks if domain changes
- DID (e.g., `did:web:data.imperial.ac.uk`) — more persistent, adds complexity
- Opaque ID with registry lookup — most persistent, requires central infrastructure

**`{type}`** — Resource type: `rec`, `dep`, `vocab`, `schema`, `val`, `tool`.

**`{local-id}`** — Node-assigned identifier, opaque to clients.

**`@{version}`** — Optional version suffix for immutable snapshots.

**`#{fragment}`** — Optional fragment for sub-resources (e.g., vocabulary attributes).

## Examples

```
urn:osa:data.imperial.ac.uk:rec:xyz789@v1
urn:osa:archive.embl.org:vocab:rnaseq@v2.1#mapped-reads-percent
urn:osa:did:web:data.imperial.ac.uk:dep:abc123
```

## Open Questions

**Node identity**: Should node-id be a DNS hostname, a DID, or something else? DNS is simple but fragile. DIDs add persistence but complexity.

**DID integration**: If nodes have DIDs (via `did:web` or similar), should the SRN embed the full DID or just the hostname with an implied DID?

**Registration**: Should `urn:osa` be registered with IANA? This adds legitimacy but bureaucracy.

**Versioning syntax**: Is `@v1` the right format? Alternatives: `/v1`, `?version=1`, separate field.

**Migration**: How should identifiers survive domain changes? Options include redirect protocols, DID-based persistence, or accepting breakage as rare.

# Alternative: DID-Native Approach

Rather than inventing SRNs, we could use DIDs directly:

```
did:osa:data.imperial.ac.uk:rec:xyz789
```

This would require defining a `did:osa` method specifying:
- Identifier format
- Resolution process
- CRUD operations on DID Documents

**Pros**: Aligns with W3C standard, potential interop with Verifiable Credentials, existing DID tooling.

**Cons**: DIDs are designed for entities not resources, would be non-standard usage, more complex resolution.

# Alternative: Minimal Approach

Use simple URLs with conventions:

```
https://data.imperial.ac.uk/osa/records/xyz789/v1
```

Rely on HTTP redirects for persistence. Accept that URLs may break.

**Pros**: Simplest to implement, no new concepts, universal tooling.

**Cons**: Fragile, no type information, conflates identity with location.

# Next Steps

This OEP seeks feedback on:

1. **Requirements**: Are the requirements complete and correctly prioritized?
2. **Existing schemes**: Are there schemes we should consider that aren't listed?
3. **SRN proposal**: Is this a reasonable starting point, or should we pursue a different direction?
4. **Node identity**: What should node-id be? DNS hostname, DID, or hybrid?
5. **Migration**: How important is surviving domain changes? What tradeoffs are acceptable?

Based on community input, a follow-up OEP will specify the chosen scheme in detail.

# References

- [RFC 8141: URN Syntax](https://www.rfc-editor.org/rfc/rfc8141)
- [W3C DID Core](https://www.w3.org/TR/did-core/)
- [DOI Handbook](https://www.doi.org/doi_handbook/)
- [AT Protocol Identity](https://atproto.com/guides/identity)
- [Amazon ARN Reference](https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html)