From eaad56278312371aeb08601279d4d8065901f7de Mon Sep 17 00:00:00 2001 From: Rory Byrne Date: Fri, 12 Dec 2025 22:06:40 +0000 Subject: [PATCH] oep: goals --- oeps/OEP-0003.md | 100 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 100 insertions(+) create mode 100644 oeps/OEP-0003.md diff --git a/oeps/OEP-0003.md b/oeps/OEP-0003.md new file mode 100644 index 0000000..908a467 --- /dev/null +++ b/oeps/OEP-0003.md @@ -0,0 +1,100 @@ +--- +oep: 0003 +title: Goals and Scope +type: informational +status: ideation +authors: Rory Byrne +created: 2025-12-12 +labels: protocol +--- + +# Abstract + +This document defines the goals, scope, and non-goals of the Open Science Archive (OSA) protocol. It answers what problems OSA solves and what it explicitly does not attempt to solve. Read this before the architecture overview (OEP-0004). + +# Motivation + +Before diving into architecture and technical details, readers need to understand what OSA is for. This document provides that context. + +# Specification + +## Problems + +### Building serious data infrastructure is prohibitively hard + +The Protein Data Bank took many years and significant funding to develop its validation pipelines, curation workflows, and quality standards. A new field wanting similar rigor faces years of development, substantial engineering investment, and the risk of building something that doesn't get adopted. Most fields can't justify this cost, so they settle for minimal archives or none at all. + +### Archives don't interoperate + +Despite this, many archives exist. However, each one is an island with its own API, conventions, and tooling ecosystems. This fragments the landscape: a tool built for one archive must be rewritten for another. Researchers write bespoke integrations for each data source. Smaller fields get left behind entirely because they lack the user base to justify custom tooling. + +### Archives duplicate generic infrastructure + +Every new archive re-invents submission portals, validation runners, metadata schemas, search APIs, and access control. The domain-specific parts get less attention because generic plumbing consumes the budget. + +## Goals + +### Deploy an archive in days, not years + +A research group can spin up a production-grade archive with validation, submission workflows, and APIs without a dedicated engineering team. + +### Discover data by quality + +Researchers can search for data using quality criteria, not just metadata keywords. + +### Plug in domain-specific logic + +Communities define their own validators, curation tools, and data conversions. The protocol provides the machinery; domains provide the semantics. + +### Attribute every quality claim + +Every assertion carries provenance: who computed it, when, with what software. Users verify rather than trust. + +### Publish immutable, citable records + +Published data gets a stable identifier. Updates create new versions, not edits. Citations remain valid. + +## Non-Goals + +### OSA is not a storage provider + +The protocol defines how archives behave, not where bytes live. Storage is the operator's choice (local disk, S3, institutional storage). + +### OSA is not a compute platform + +Validators run during submission, but OSA is not a general-purpose compute system. It does not manage jobs, queues, or cluster resources. + +### OSA is not a single database + +There is no central OSA database. The protocol enables federation between independent nodes. + +### OSA does not define domain semantics + +The protocol does not say what "quality" means for any particular field. Communities define their own vocabularies and validators. + +### OSA does not enforce quality thresholds + +The protocol computes and exposes quality attributes. It does not decide what is "good enough". That judgment belongs to users and communities. + +### OSA does not replace existing archives + +OSA is designed to complement existing infrastructure. Index Nodes can compute attributes about data in external archives (GEO, SRA, PDB) without requiring those archives to change. + +# Rationale + +Separating goals from architecture makes it easier to evaluate whether the architecture actually serves the goals. It also helps readers who want to understand the purpose without reading technical details. + +The explicit non-goals prevent scope creep and set expectations. OSA is infrastructure for a specific set of problems, not a universal solution. + +# Backwards Compatibility + +N/A. This is an informational document. + +# Security & Privacy + +This document defines goals and scope. Security and privacy considerations are addressed in the relevant technical OEPs. + +# Open Issues + +- Should OSA define a minimal "core" vocabulary that all nodes understand, or is everything domain-specific? +- How do we balance ease of deployment with operational security requirements?