Skip to content

feat: content-addressable storage (CAS) for scientific data files #99

@rorybyrne

Description

@rorybyrne

Summary

Replace the current path-based file storage with content-addressable storage (CAS) where files are stored by SHA256 hash and referenced by depositions, records, and manifests via hash rather than filesystem path.

Motivation

The current architecture stores files at conventional filesystem paths (depositions/{id}/files/) and moves them between lifecycle stages using rename() / shutil.copy(). This breaks on non-POSIX storage backends (S3 mountpoint gives EPERM on metadata-preserving copies — see the reference-based fix that eliminates copies as step 1).

CAS eliminates file copies entirely (store once, reference many times), enables deduplication (critical for scientific data where reference genomes/structures are shared across thousands of depositions), and provides integrity verification (the hash IS the address — corruption is impossible without detection).

Architecture

S3 Bucket / Local Filesystem:
  /blobs/
    sha256:abc123...    # Raw files stored by content hash
    sha256:def456...
  /lake/                # Future: datalake projection
    records/            # Parquet/Iceberg metadata tables
    features/           # Validator-extracted features

Record manifest with CAS references:

{
  "srn": "urn:osa:node:rec:123@1",
  "files": [
    {"name": "structure.cif", "hash": "sha256:abc123", "size": 4096},
    {"name": "metadata.json", "hash": "sha256:def456", "size": 512}
  ]
}

Why CAS Matters for OSA

  • Deduplication: Same reference genome across many instances/depositions stored once
  • Federation: Content-addressed = location-independent. Mirror records between nodes without re-copying blobs
  • Datalake-ready: CAS hashes are stable identifiers for datalake metadata tables (Iceberg/Parquet). Foundation for analytics over published scientific data
  • Integrity: SHA256 hash verifies scientific data hasn't been corrupted — important for reproducibility

Evolution Path

  1. Reference-based storage (prerequisite — eliminates copies, introduces FileReference type)
  2. Content-addressable storage (this issue — FileReference gains content hash, blobs stored by hash)
  3. Datalake projection (Export domain projects manifests → Iceberg tables)
  4. Incremental lake (changefeed → streaming Iceberg updates)

Scope

  • FileReference type gains content_hash: str field (SHA256 computed on ingest)
  • New BlobStore port: put(content) → hash, get(hash) → stream, exists(hash) → bool
  • FilesystemBlobAdapter (local): stores at {base}/blobs/{hash}
  • S3BlobAdapter: stores at s3://{bucket}/blobs/{hash}
  • Deposition file upload computes hash, stores blob, records reference
  • Source ingest computes hash after container writes files
  • Garbage collection: periodic sweep of unreferenced blobs
  • Migration: backfill hashes for existing files (non-breaking, additive)

Depends On

  • Reference-based file storage (FileReference type, no-copy source→deposition flow)

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationfeatureNew functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions