Summary
Make S3 (or CAS blob storage) the canonical, durable store for hook-produced features (e.g., protein pocket predictions, alignment metrics, QC checks). PostgreSQL materialization becomes an optional read-optimized projection that can be rebuilt from the canonical source at any time.
Motivation
Currently, hook-produced features flow through a one-way pipeline:
Hook k8s Job → writes features.json to disk → InsertRecordFeatures reads it
→ creates dynamic PG table → inserts rows → features.json is ephemeral
PostgreSQL is the sole source of truth. If the database is lost or a feature table needs to be rebuilt (schema change, index update, bug fix), the original data is gone. The features.json written by the hook is a transport artifact that is not durably stored.
This is fragile for a system that stores scientific data. Hook outputs are computed artifacts that may be expensive to regenerate (hours of compute, external API calls). They should be stored durably alongside the record's data files.
Proposed Architecture
Hook k8s Job → writes features.json
→ stored durably in S3/CAS as a canonical artifact (source of truth)
→ optionally materialized into PG for fast SQL queries (read projection)
This aligns with the existing CQRS pattern: S3 is the write model, PG feature tables are the read model. Materialization is an event-driven projection that can be replayed.
Key Properties
- S3 is canonical: feature data is durable, versioned by content hash (when CAS lands), and always recoverable
- PG is optional: operators can choose which features to materialize for SQL query performance
- Rebuildable: PG feature tables can be dropped and rebuilt from S3 at any time (e.g., after schema evolution)
- Decoupled: feature storage policy is independent of how records are created (deposition, harvest, fork, import)
Relationship to Other Issues
Open Questions
- Should materialization be opt-in per hook definition (e.g.,
materialize: true in HookDefinition), or should all features be materialized by default with an opt-out?
- Should the canonical S3 representation be the raw
features.json from the hook, or a normalized format?
- When PG tables are rebuilt from S3, how is schema evolution handled (e.g., hook adds a new column)?
Summary
Make S3 (or CAS blob storage) the canonical, durable store for hook-produced features (e.g., protein pocket predictions, alignment metrics, QC checks). PostgreSQL materialization becomes an optional read-optimized projection that can be rebuilt from the canonical source at any time.
Motivation
Currently, hook-produced features flow through a one-way pipeline:
PostgreSQL is the sole source of truth. If the database is lost or a feature table needs to be rebuilt (schema change, index update, bug fix), the original data is gone. The
features.jsonwritten by the hook is a transport artifact that is not durably stored.This is fragile for a system that stores scientific data. Hook outputs are computed artifacts that may be expensive to regenerate (hours of compute, external API calls). They should be stored durably alongside the record's data files.
Proposed Architecture
This aligns with the existing CQRS pattern: S3 is the write model, PG feature tables are the read model. Materialization is an event-driven projection that can be replayed.
Key Properties
Relationship to Other Issues
Open Questions
materialize: truein HookDefinition), or should all features be materialized by default with an opt-out?features.jsonfrom the hook, or a normalized format?