feat: content-addressable storage (CAS) for scientific data files

## Summary

Replace the current path-based file storage with content-addressable storage (CAS) where files are stored by SHA256 hash and referenced by depositions, records, and manifests via hash rather than filesystem path.

## Motivation

The current architecture stores files at conventional filesystem paths (`depositions/{id}/files/`) and moves them between lifecycle stages using `rename()` / `shutil.copy()`. This breaks on non-POSIX storage backends (S3 mountpoint gives EPERM on metadata-preserving copies — see the reference-based fix that eliminates copies as step 1).

CAS eliminates file copies entirely (store once, reference many times), enables deduplication (critical for scientific data where reference genomes/structures are shared across thousands of depositions), and provides integrity verification (the hash IS the address — corruption is impossible without detection).

## Architecture

```
S3 Bucket / Local Filesystem:
  /blobs/
    sha256:abc123...    # Raw files stored by content hash
    sha256:def456...
  /lake/                # Future: datalake projection
    records/            # Parquet/Iceberg metadata tables
    features/           # Validator-extracted features
```

Record manifest with CAS references:
```json
{
  "srn": "urn:osa:node:rec:123@1",
  "files": [
    {"name": "structure.cif", "hash": "sha256:abc123", "size": 4096},
    {"name": "metadata.json", "hash": "sha256:def456", "size": 512}
  ]
}
```

## Why CAS Matters for OSA

- **Deduplication**: Same reference genome across many instances/depositions stored once
- **Federation**: Content-addressed = location-independent. Mirror records between nodes without re-copying blobs
- **Datalake-ready**: CAS hashes are stable identifiers for datalake metadata tables (Iceberg/Parquet). Foundation for analytics over published scientific data
- **Integrity**: SHA256 hash verifies scientific data hasn't been corrupted — important for reproducibility

## Evolution Path

1. ~~Reference-based storage~~ (prerequisite — eliminates copies, introduces FileReference type)
2. **Content-addressable storage** (this issue — FileReference gains content hash, blobs stored by hash)
3. Datalake projection (Export domain projects manifests → Iceberg tables)
4. Incremental lake (changefeed → streaming Iceberg updates)

## Scope

- `FileReference` type gains `content_hash: str` field (SHA256 computed on ingest)
- New `BlobStore` port: `put(content) → hash`, `get(hash) → stream`, `exists(hash) → bool`
- `FilesystemBlobAdapter` (local): stores at `{base}/blobs/{hash}`
- `S3BlobAdapter`: stores at `s3://{bucket}/blobs/{hash}`
- Deposition file upload computes hash, stores blob, records reference
- Source ingest computes hash after container writes files
- Garbage collection: periodic sweep of unreferenced blobs
- Migration: backfill hashes for existing files (non-breaking, additive)

## Depends On

- Reference-based file storage (FileReference type, no-copy source→deposition flow)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: content-addressable storage (CAS) for scientific data files #99

Summary

Motivation

Architecture

Why CAS Matters for OSA

Evolution Path

Scope

Depends On

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: content-addressable storage (CAS) for scientific data files #99

Description

Summary

Motivation

Architecture

Why CAS Matters for OSA

Evolution Path

Scope

Depends On

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions