opensciencearchive · rorybyrne · Dec 10, 2025
diff --git a/oeps/OEP-0005.md b/oeps/OEP-0005.md
@@ -0,0 +1,261 @@
+---
+oep: 0005
+title: Resource Identifiers
+type: technical
+status: ideation
+authors: Rory Byrne <rory@rory.bio>
+created: 2025-12-07
+labels: protocol, identifiers
+---
+
+# Abstract
+
+This OEP explores identifier schemes for resources in the Open Science Archive protocol. Rather than prescribing a specific format, it establishes the requirements that any identifier scheme must satisfy, surveys existing approaches, and proposes one candidate (Structured Resource Names) for community feedback.
+
+# Motivation
+
+Every resource in OSA—Records, Depositions, Vocabularies, Schemas, Validators—needs an identifier. The choice of identifier scheme has far-reaching consequences for the protocol's usability, longevity, and interoperability.
+
+Scientific data archives present unique challenges:
+
+- **Longevity**: Identifiers may be cited in papers for decades
+- **Federation**: Multiple independent nodes must avoid collisions
+- **Machine use**: Software needs to parse, route, and validate identifiers
+- **Human use**: Developers and researchers need to debug and discuss identifiers
+
+Getting this wrong is costly. Changing identifier schemes after deployment breaks existing references.
+
+# Requirements
+
+Any identifier scheme for OSA should satisfy the following properties:
+
+## Must Have
+
+**Globally unique**: Two resources must never share an identifier, even across independent nodes operated by different organizations.
+
+**Resolvable**: Given an identifier, there must be a defined mechanism to retrieve the resource or its metadata.
+
+**Stable**: Once assigned, an identifier must continue to refer to the same resource. Identifiers should not be reassigned or recycled.
+
+## Should Have
+
+**Human-readable**: Developers should be able to understand what an identifier refers to without dereferencing it. At minimum, identifiers should be pronounceable and not excessively long.
+
+**Type-aware**: The identifier should indicate what kind of resource it refers to (Record, Vocabulary, etc.), enabling validation and routing without network calls.
+
+**Version-aware**: For versioned resources, the identifier should support pinning to a specific version.
+
+**Decentralized minting**: Nodes should be able to create identifiers without coordinating with a central authority.
+
+## Nice to Have
+
+**Persistent across migrations**: If an organization changes its domain or infrastructure, existing identifiers should remain valid.
+
+**Content-addressable**: Identifiers could be derived from content hashes, enabling integrity verification and deduplication.
+
+**Compatible with existing standards**: Alignment with URN, DID, DOI, or other established schemes reduces implementation burden and improves interoperability.
+
+# Existing Approaches
+
+## URLs
+
+```
+https://archive.example.org/records/abc123
+```
+
+**Pros**: Universal, familiar, directly resolvable, existing tooling.
+
+**Cons**: Conflates identity with location. When domains change, URLs break. No built-in versioning or typing.
+
+**Used by**: Most web APIs, many data repositories.
+
+## DOIs (Digital Object Identifiers)
+
+```
+doi:10.1234/abc.5678
+```
+
+**Pros**: Designed for persistence, widely adopted in academia, resolver infrastructure exists (doi.org), citable in papers.
+
+**Cons**: Opaque (no type or origin information), requires registration with a DOI agency (cost, bureaucracy), resolution depends on Handle System (centralized).
+
+**Used by**: Academic publishing, Zenodo, Figshare, DataCite.
+
+## URNs (Uniform Resource Names)
+
+```
+urn:isbn:978-3-16-148410-0
+urn:ietf:rfc:3986
+```
+
+**Pros**: W3C/IETF standard, separates naming from resolution, extensible namespace system.
+
+**Cons**: No universal resolution mechanism (each namespace defines its own), requires IANA registration for formal namespaces.
+
+**Used by**: ISBN, IETF RFCs, various domain-specific schemes.
+
+## DIDs (Decentralized Identifiers)
+
+```
+did:web:example.org
+did:plc:abc123xyz
+```
+
+**Pros**: W3C standard, designed for decentralization, supports cryptographic verification, multiple "methods" for different tradeoffs.
+
+**Cons**: Designed for entities (people, organizations) not resources, verbose, emerging ecosystem.
+
+**Used by**: AT Protocol (Bluesky), identity wallets, Verifiable Credentials.
+
+## ARNs (Amazon Resource Names)
+
+```
+arn:aws:s3:us-east-1:123456789:bucket/object
+```
+
+**Pros**: Proven at scale, encodes region/account/service/resource hierarchy, enables policy-based access control.
+
+**Cons**: AWS-specific, complex syntax, assumes single operator (Amazon).
+
+**Used by**: All AWS services.
+
+## UUIDs
+
+```
+550e8400-e29b-41d4-a716-446655440000
+```
+
+**Pros**: Trivial to generate, guaranteed unique (v4), no coordination required.
+
+**Cons**: Opaque, no context about resource type or origin, not human-friendly, not directly resolvable.
+
+**Used by**: Databases, internal systems, anywhere uniqueness matters more than readability.
+
+## Content Identifiers (CIDs)
+
+```
+bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oca...
+```
+
+**Pros**: Derived from content hash, self-verifying, enables deduplication, immutable by design.
+
+**Cons**: Long, not human-readable, requires content to generate identifier, any content change = new identifier.
+
+**Used by**: IPFS, Filecoin, content-addressed storage systems.
+
+# Analysis
+
+| Scheme | Unique | Resolvable | Stable | Readable | Typed | Versioned | Decentralized |
+|--------|--------|------------|--------|----------|-------|-----------|---------------|
+| URLs | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ |
+| DOIs | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
+| URNs | ✓ | ◐ | ✓ | ◐ | ◐ | ◐ | ◐ |
+| DIDs | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
+| ARNs | ✓ | ✓ | ✓ | ◐ | ✓ | ✗ | ✗ |
+| UUIDs | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ |
+| CIDs | ✓ | ◐ | ✓ | ✗ | ✗ | ✗ | ✓ |
+
+✓ = yes, ✗ = no, ◐ = partial/depends
+
+No existing scheme fully satisfies our requirements. This suggests either:
+
+1. Extending an existing scheme (likely URN or DID)
+2. Defining a new scheme purpose-built for OSA
+
+# Candidate: Structured Resource Names (SRNs)
+
+One option is to define a URN-based scheme that embeds the properties we need. We propose this as a starting point for discussion, not as a final specification.
+
+## Format
+
+```
+urn:osa:{node-id}:{type}:{local-id}[@{version}][#{fragment}]
+```
+
+## Components
+
+**`urn:osa:`** — Fixed prefix indicating an OSA identifier.
+
+**`{node-id}`** — The originating node. Options:
+- DNS hostname (e.g., `data.imperial.ac.uk`) — simple, enables direct resolution, but breaks if domain changes
+- DID (e.g., `did:web:data.imperial.ac.uk`) — more persistent, adds complexity
+- Opaque ID with registry lookup — most persistent, requires central infrastructure
+
+**`{type}`** — Resource type: `rec`, `dep`, `vocab`, `schema`, `val`, `tool`.
+
+**`{local-id}`** — Node-assigned identifier, opaque to clients.
+
+**`@{version}`** — Optional version suffix for immutable snapshots.
+
+**`#{fragment}`** — Optional fragment for sub-resources (e.g., vocabulary attributes).
+
+## Examples
+
+```
+urn:osa:data.imperial.ac.uk:rec:xyz789@v1
+urn:osa:archive.embl.org:vocab:rnaseq@v2.1#mapped-reads-percent
+urn:osa:did:web:data.imperial.ac.uk:dep:abc123
+```
+
+## Open Questions
+
+**Node identity**: Should node-id be a DNS hostname, a DID, or something else? DNS is simple but fragile. DIDs add persistence but complexity.
+
+**DID integration**: If nodes have DIDs (via `did:web` or similar), should the SRN embed the full DID or just the hostname with an implied DID?
+
+**Registration**: Should `urn:osa` be registered with IANA? This adds legitimacy but bureaucracy.
+
+**Versioning syntax**: Is `@v1` the right format? Alternatives: `/v1`, `?version=1`, separate field.
+
+**Migration**: How should identifiers survive domain changes? Options include redirect protocols, DID-based persistence, or accepting breakage as rare.
+
+# Alternative: DID-Native Approach
+
+Rather than inventing SRNs, we could use DIDs directly:
+
+```
+did:osa:data.imperial.ac.uk:rec:xyz789
+```
+
+This would require defining a `did:osa` method specifying:
+- Identifier format
+- Resolution process
+- CRUD operations on DID Documents
+
+**Pros**: Aligns with W3C standard, potential interop with Verifiable Credentials, existing DID tooling.
+
+**Cons**: DIDs are designed for entities not resources, would be non-standard usage, more complex resolution.
+
+# Alternative: Minimal Approach
+
+Use simple URLs with conventions:
+
+```
+https://data.imperial.ac.uk/osa/records/xyz789/v1
+```
+
+Rely on HTTP redirects for persistence. Accept that URLs may break.
+
+**Pros**: Simplest to implement, no new concepts, universal tooling.
+
+**Cons**: Fragile, no type information, conflates identity with location.
+
+# Next Steps
+
+This OEP seeks feedback on:
+
+1. **Requirements**: Are the requirements complete and correctly prioritized?
+2. **Existing schemes**: Are there schemes we should consider that aren't listed?
+3. **SRN proposal**: Is this a reasonable starting point, or should we pursue a different direction?
+4. **Node identity**: What should node-id be? DNS hostname, DID, or hybrid?
+5. **Migration**: How important is surviving domain changes? What tradeoffs are acceptable?
+
+Based on community input, a follow-up OEP will specify the chosen scheme in detail.
+
+# References
+
+- [RFC 8141: URN Syntax](https://www.rfc-editor.org/rfc/rfc8141)
+- [W3C DID Core](https://www.w3.org/TR/did-core/)
+- [DOI Handbook](https://www.doi.org/doi_handbook/)
+- [AT Protocol Identity](https://atproto.com/guides/identity)
+- [Amazon ARN Reference](https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html)