Summary
Replace the current path-based file storage with content-addressable storage (CAS) where files are stored by SHA256 hash and referenced by depositions, records, and manifests via hash rather than filesystem path.
Motivation
The current architecture stores files at conventional filesystem paths (depositions/{id}/files/) and moves them between lifecycle stages using rename() / shutil.copy(). This breaks on non-POSIX storage backends (S3 mountpoint gives EPERM on metadata-preserving copies — see the reference-based fix that eliminates copies as step 1).
CAS eliminates file copies entirely (store once, reference many times), enables deduplication (critical for scientific data where reference genomes/structures are shared across thousands of depositions), and provides integrity verification (the hash IS the address — corruption is impossible without detection).
Architecture
S3 Bucket / Local Filesystem:
/blobs/
sha256:abc123... # Raw files stored by content hash
sha256:def456...
/lake/ # Future: datalake projection
records/ # Parquet/Iceberg metadata tables
features/ # Validator-extracted features
Record manifest with CAS references:
{
"srn": "urn:osa:node:rec:123@1",
"files": [
{"name": "structure.cif", "hash": "sha256:abc123", "size": 4096},
{"name": "metadata.json", "hash": "sha256:def456", "size": 512}
]
}
Why CAS Matters for OSA
- Deduplication: Same reference genome across many instances/depositions stored once
- Federation: Content-addressed = location-independent. Mirror records between nodes without re-copying blobs
- Datalake-ready: CAS hashes are stable identifiers for datalake metadata tables (Iceberg/Parquet). Foundation for analytics over published scientific data
- Integrity: SHA256 hash verifies scientific data hasn't been corrupted — important for reproducibility
Evolution Path
Reference-based storage (prerequisite — eliminates copies, introduces FileReference type)
- Content-addressable storage (this issue — FileReference gains content hash, blobs stored by hash)
- Datalake projection (Export domain projects manifests → Iceberg tables)
- Incremental lake (changefeed → streaming Iceberg updates)
Scope
FileReference type gains content_hash: str field (SHA256 computed on ingest)
- New
BlobStore port: put(content) → hash, get(hash) → stream, exists(hash) → bool
FilesystemBlobAdapter (local): stores at {base}/blobs/{hash}
S3BlobAdapter: stores at s3://{bucket}/blobs/{hash}
- Deposition file upload computes hash, stores blob, records reference
- Source ingest computes hash after container writes files
- Garbage collection: periodic sweep of unreferenced blobs
- Migration: backfill hashes for existing files (non-breaking, additive)
Depends On
- Reference-based file storage (FileReference type, no-copy source→deposition flow)
Summary
Replace the current path-based file storage with content-addressable storage (CAS) where files are stored by SHA256 hash and referenced by depositions, records, and manifests via hash rather than filesystem path.
Motivation
The current architecture stores files at conventional filesystem paths (
depositions/{id}/files/) and moves them between lifecycle stages usingrename()/shutil.copy(). This breaks on non-POSIX storage backends (S3 mountpoint gives EPERM on metadata-preserving copies — see the reference-based fix that eliminates copies as step 1).CAS eliminates file copies entirely (store once, reference many times), enables deduplication (critical for scientific data where reference genomes/structures are shared across thousands of depositions), and provides integrity verification (the hash IS the address — corruption is impossible without detection).
Architecture
Record manifest with CAS references:
{ "srn": "urn:osa:node:rec:123@1", "files": [ {"name": "structure.cif", "hash": "sha256:abc123", "size": 4096}, {"name": "metadata.json", "hash": "sha256:def456", "size": 512} ] }Why CAS Matters for OSA
Evolution Path
Reference-based storage(prerequisite — eliminates copies, introduces FileReference type)Scope
FileReferencetype gainscontent_hash: strfield (SHA256 computed on ingest)BlobStoreport:put(content) → hash,get(hash) → stream,exists(hash) → boolFilesystemBlobAdapter(local): stores at{base}/blobs/{hash}S3BlobAdapter: stores ats3://{bucket}/blobs/{hash}Depends On