Skip to content

feat: typed metadata tables + expressive REST query DSL #76

@rorybyrne

Description

@rorybyrne

Summary

Generate typed PostgreSQL tables from Schema FieldDefinition[] (mirroring the existing feature-table DDL pattern) and extend the existing /discovery/* REST endpoints to support compound filters (nested AND/OR), pgvector similarity, and full-text search. This gives records, metadata, and features a unified typed query surface using the DDD/CQRS machinery that already exists.

Context

  • Records have metadata (currently JSONB in records.metadata) and features (typed PG tables in features.*)
  • Feature tables are already generated dynamically via build_feature_table() in server/osa/infrastructure/persistence/feature_table.py
  • Current discovery surface: POST /discovery/records + POST /discovery/features/{hook} — flat AND filters with EQ / CONTAINS / GTE / LTE, keyset pagination
  • DiscoveryReadStore protocol in server/osa/domain/discovery/port/read_store.py already abstracts the read path — the service (service/discovery.py) does validation + operator whitelisting
  • Gap: metadata is raw JSONB (no typed filtering), no OR/compound logic, no vector or full-text operators, no cross-table joins

The expressive query surface (GraphQL vs MCP vs typed REST DSL) for LLM and agent consumers is tracked separately in #125 — that decision builds on top of this one.

The metadata gap

records.metadata is JSONB. You can filter via PG JSONB containment but you cannot express typed range queries, cannot index effectively, and cannot expose typed filters through any auto-generated API layer.

Fix: generate typed metadata tables from Schema FieldDefinition[], same pattern as the existing feature tables. A convention with a schema defining species: term, resolution: number, title: text gets a metadata.<convention_name> table with real typed columns.

Dynamic tables today

Feature tables use build_feature_table() with column type mapping in column_mapper.py:

JSON type Format PostgreSQL type
string Text
string date-time DateTime(tz)
string date Date
string uuid Uuid
number Float(53)
integer BigInteger
boolean Boolean
array JSONB
object JSONB

Feature tables live in the features PG schema, tracked in a feature_tables catalog table. Metadata tables follow the identical pattern in a metadata PG schema, tracked in a metadata_tables catalog.

Query DSL extensions

Extend Filter and DiscoveryReadStore to support:

  1. Compound groups — nested {and: [...]} / {or: [...]} / {not: ...} trees instead of a flat AND list
  2. Cross-table join — filter records by feature column values (e.g. records where features.cell_classifier.confidence > 0.9)
  3. Vector similarity — pgvector operator as a first-class filter + orderBy (requires a vector column on the relevant table)
  4. Full-text — tsvector @@ operator with ts_rank ordering

Validation lives in DiscoveryService (already the pattern). SQL compilation lives in the PostgresDiscoveryReadStore adapter (already the pattern). No new service, no new process.

Implementation steps

Phase 1: Typed metadata tables

  • Generate metadata PG tables from Schema FieldDefinition[] (replicate feature-table DDL pattern in a metadata PG schema)
  • Add metadata_tables catalog (mirrors feature_tables)
  • FK constraint from feature tables to records (refactor: add foreign key constraint from feature tables to records #75)
  • FK constraint from metadata tables to records
  • Populate metadata tables during record publication (ingest + deposition paths)
  • Migrate existing records' JSONB metadata → typed tables (one-time)
  • Deprecate direct reads from records.metadata JSONB

Phase 2: Compound filter DSL

  • Design the filter AST (nested AND/OR/NOT groups over existing operators)
  • Extend Filter value object in domain/discovery/model/value.py
  • Extend validation in DiscoveryService to walk compound trees
  • Extend PostgresDiscoveryReadStore to compile compound filters to SQLAlchemy
  • Update POST /discovery/records and POST /discovery/features/{hook} request schemas
  • Maintain backward compat with the current flat-AND shape for one release

Phase 3: Cross-table joins

  • Allow record filters to reference feature-table columns (e.g. features.<hook>.<column>)
  • Compile to SQL JOINs with proper indexing

Phase 4: Vector + full-text operators

  • Add pgvector similarity filter + orderBy
  • Add tsvector full-text filter with ts_rank ordering
  • Document indexing requirements for each

Phase 5: Parquet export

  • GET /export/conventions/{srn}/records.parquet — query metadata + feature tables, stream Parquet via PyArrow
  • Optional filter query parameters (reuse the compound DSL)

Depends on

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationfeatureNew functionalityinfrastructureCI, Docker, deployment, migrations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions