Skip to content

feat: make S3 the canonical store for hook-produced features #105

@rorybyrne

Description

@rorybyrne

Summary

Make S3 (or CAS blob storage) the canonical, durable store for hook-produced features (e.g., protein pocket predictions, alignment metrics, QC checks). PostgreSQL materialization becomes an optional read-optimized projection that can be rebuilt from the canonical source at any time.

Motivation

Currently, hook-produced features flow through a one-way pipeline:

Hook k8s Job → writes features.json to disk → InsertRecordFeatures reads it
→ creates dynamic PG table → inserts rows → features.json is ephemeral

PostgreSQL is the sole source of truth. If the database is lost or a feature table needs to be rebuilt (schema change, index update, bug fix), the original data is gone. The features.json written by the hook is a transport artifact that is not durably stored.

This is fragile for a system that stores scientific data. Hook outputs are computed artifacts that may be expensive to regenerate (hours of compute, external API calls). They should be stored durably alongside the record's data files.

Proposed Architecture

Hook k8s Job → writes features.json
  → stored durably in S3/CAS as a canonical artifact (source of truth)
  → optionally materialized into PG for fast SQL queries (read projection)

This aligns with the existing CQRS pattern: S3 is the write model, PG feature tables are the read model. Materialization is an event-driven projection that can be replayed.

Key Properties

  • S3 is canonical: feature data is durable, versioned by content hash (when CAS lands), and always recoverable
  • PG is optional: operators can choose which features to materialize for SQL query performance
  • Rebuildable: PG feature tables can be dropped and rebuilt from S3 at any time (e.g., after schema evolution)
  • Decoupled: feature storage policy is independent of how records are created (deposition, harvest, fork, import)

Relationship to Other Issues

Open Questions

  • Should materialization be opt-in per hook definition (e.g., materialize: true in HookDefinition), or should all features be materialized by default with an opt-out?
  • Should the canonical S3 representation be the raw features.json from the hook, or a normalized format?
  • When PG tables are rebuilt from S3, how is schema evolution handled (e.g., hook adds a new column)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationfeatureNew functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions