Skip to content

feat: typed metadata tables + compound discovery DSL (#76)#128

Merged
rorybyrne merged 9 commits intomainfrom
076-feat-typed-metadata
Apr 23, 2026
Merged

feat: typed metadata tables + compound discovery DSL (#76)#128
rorybyrne merged 9 commits intomainfrom
076-feat-typed-metadata

Conversation

@rorybyrne
Copy link
Copy Markdown
Contributor

Closes #76.

Summary

  • Typed PostgreSQL tables per schema (metadata.<slug>_v<major>), generated from FieldDefinition[] using the same dynamic-DDL pattern as feature tables.
  • Compound FilterExpr DSL for /discovery/* (nested AND/OR/NOT + predicates), replacing the flat filter list.
  • Schema identity is now a user-supplied slug (<id>@<semver>, e.g. pdb-structure@1.0.0), no server-generated UUIDs.
  • Metadata projection is synchronous dual-write inside RecordService — both records (JSONB, canonical for cross-schema reads) and metadata.<slug>_v<major> (typed, canonical for schema-pinned discovery) land in one transaction.

What's in this branch

Three commits:

  1. 8167cea feat: add compound filter expressions and typed metadata tables — the original feature. FilterExpr discriminated union, build_metadata_table(), metadata_tables catalog, typed /discovery/records + /discovery/features/{hook} compilation, schema-pinned and unscoped paths.

  2. a356770 refactor: replace schema SRN with short-form schema ID throughout system — internal types use SchemaId ("<id>@<semver>"); SchemaSRN reserved for federation edges. Schemas now require a user-supplied slug at create time (matched to ^[a-z][a-z0-9-]{2,63}$). SDK's MetadataSchema subclasses enforce __schema_id__ at class-definition time. api_naming.py seam added for future API↔storage divergence.

  3. 7041682 refactor: replace async metadata projection with synchronous dual-writeRecordService.bulk_publish / publish_record write to both records and the typed table in one transaction. ConventionService.create_convention creates the typed table inline (no race window). Deletes EnsureMetadataTable, InsertRecordMetadata, InsertBatchMetadata event handlers. PG type constraints become the metadata validator.

Architectural notes

  • JSONB stays canonical for cross-schema reads. GET /records/{srn}, unscoped discovery listings, and federation exports all read from records.metadata. Typed tables serve schema-pinned queries. Both are written atomically so they can't disagree.
  • Schema pin is required whenever a query touches metadata.<field> predicates, a non-default sort, or free-text q. Plain listings (no filter, default sort, no q) may omit it.
  • Features stay event-driven. Features come from hook execution (async by nature — OCI containers). Only metadata projection moves to synchronous dual-write.

Test plan

  • Unit tests pass (1052 tests)
  • Integration tests pass (88 tests — new coverage for insert_many, dual-write, synchronous table creation, rollback-on-malformed-data)
  • ruff check and ty check clean
  • Smoke test on osa_pdb: fresh ingest populates both records and metadata.pdb_structure_v1
  • getRecordByPdbId returns the expected record for newly ingested PDB IDs
  • Frontend (Pockets) can pin schema=pdb-structure@1.0.0 and filter by metadata.pdb_id

Migration

Migrations in this branch (076_*):

  • 076_add_metadata_schema_and_catalogmetadata PG schema + metadata_tables catalog + pgvector enable + FK on features.*
  • 076_schemas_to_id — splits schemas.srn into (id, version) PK, splits conventions.schema_srn into (schema_id, schema_version)
  • 076_add_records_schema_srn — adds records.schema_id + schema_version with backfill

Implement feature 076 with compound FilterExpr trees (And/Or/Not/Predicate)
replacing flat Filter lists, typed field references (metadata.field,
features.hook.column), and per-schema metadata tables for optimized queries.

Add MetadataStore port with PostgreSQL implementation for schema-keyed DDL
lifecycle, EnsureMetadataTable/InsertRecordMetadata event handlers, and
discovery adapter support for cross-domain joins with configurable bounds.

test: add comprehensive integration and unit tests for metadata system

Add extensive test coverage for metadata table management, discovery
filtering, and schema evolution including:

- Cross-domain JOIN tests between records and features
- Typed metadata table creation and evolution validation
- Filter expression parsing and validation with AND/OR/NOT support
- Metadata service integration with PostgreSQL backend
- Schema evolution bounds checking and non-additive rejection
- Event handler tests for metadata table lifecycle
- Field reference resolution and cursor encoding tests
Replace URN-based schema references with short-form `(id, version)` pairs to simplify internal operations while reserving full URNs for federation edges.

Key changes:
- Replace `SchemaSRN` with `SchemaId` in domain models and services
- Update database schema: `schemas.srn` → `schemas.id` + `schemas.version`
- Update `conventions.schema_srn` → `conventions.schema_id` + `conventions.schema_version`
- Update `records.schema_srn` → `records.schema_id` + `records.schema_version`
- Add `SchemaIdentifier` type for human-readable schema slugs
- Update metadata table catalog to use `schema_id` instead of `schema_identity`
- Require pinned schema for metadata filtering (remove JSONB fallback)
- Add schema ID validation and coercion in API layer
- Update discovery queries to work with short-form schema references

refactor: replace SchemaSRN with SchemaId across domain models

Replace SchemaSRN with simplified SchemaId that only contains id and
version without full URN structure. Update all references in models,
services, tests and handlers to use the new SchemaId format.
Move metadata table creation and row insertion from event handlers to
inline operations within ConventionService and RecordService to
eliminate race conditions and ensure atomicity between canonical
records and typed metadata tables.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Code Coverage

Package Line Rate Complexity Health
. 81% 0
application 100% 0
application.api 100% 0
application.api.rest 85% 0
application.api.v1 82% 0
application.api.v1.routes 64% 0
application.event 100% 0
domain 100% 0
domain.auth 100% 0
domain.auth.command 90% 0
domain.auth.event 100% 0
domain.auth.model 93% 0
domain.auth.port 99% 0
domain.auth.query 93% 0
domain.auth.service 91% 0
domain.auth.util 100% 0
domain.auth.util.di 79% 0
domain.curation 100% 0
domain.curation.adapter 100% 0
domain.curation.command 100% 0
domain.curation.event 100% 0
domain.curation.handler 92% 0
domain.curation.model 100% 0
domain.curation.port 100% 0
domain.curation.query 100% 0
domain.curation.service 100% 0
domain.deposition 100% 0
domain.deposition.adapter 100% 0
domain.deposition.command 87% 0
domain.deposition.event 100% 0
domain.deposition.handler 100% 0
domain.deposition.model 98% 0
domain.deposition.port 100% 0
domain.deposition.query 86% 0
domain.deposition.service 97% 0
domain.deposition.util.di 94% 0
domain.discovery 100% 0
domain.discovery.model 96% 0
domain.discovery.port 100% 0
domain.discovery.query 100% 0
domain.discovery.service 84% 0
domain.discovery.util 100% 0
domain.discovery.util.di 95% 0
domain.export 100% 0
domain.export.adapter 100% 0
domain.export.command 100% 0
domain.export.event 100% 0
domain.export.model 100% 0
domain.export.port 100% 0
domain.export.query 100% 0
domain.export.service 100% 0
domain.feature 100% 0
domain.feature.event 0% 0
domain.feature.handler 68% 0
domain.feature.model 0% 0
domain.feature.port 100% 0
domain.feature.service 96% 0
domain.feature.util 100% 0
domain.feature.util.di 100% 0
domain.index 100% 0
domain.index.event 100% 0
domain.index.handler 75% 0
domain.index.model 84% 0
domain.index.service 100% 0
domain.ingest 100% 0
domain.ingest.command 79% 0
domain.ingest.event 100% 0
domain.ingest.handler 52% 0
domain.ingest.model 100% 0
domain.ingest.port 100% 0
domain.ingest.service 76% 0
domain.metadata 100% 0
domain.metadata.event 100% 0
domain.metadata.handler 100% 0
domain.metadata.model 0% 0
domain.metadata.port 100% 0
domain.metadata.service 93% 0
domain.metadata.util 100% 0
domain.metadata.util.di 100% 0
domain.record 100% 0
domain.record.adapter 100% 0
domain.record.command 100% 0
domain.record.event 100% 0
domain.record.handler 100% 0
domain.record.model 100% 0
domain.record.port 100% 0
domain.record.query 100% 0
domain.record.service 66% 0
domain.search 100% 0
domain.search.adapter 100% 0
domain.search.command 100% 0
domain.search.event 100% 0
domain.search.model 100% 0
domain.search.port 100% 0
domain.search.query 100% 0
domain.search.service 100% 0
domain.semantics 100% 0
domain.semantics.command 94% 0
domain.semantics.event 100% 0
domain.semantics.handler 100% 0
domain.semantics.model 100% 0
domain.semantics.port 100% 0
domain.semantics.query 90% 0
domain.semantics.service 100% 0
domain.semantics.util 100% 0
domain.semantics.util.di 93% 0
domain.shared 94% 0
domain.shared.authorization 87% 0
domain.shared.model 93% 0
domain.shared.port 100% 0
domain.validation 100% 0
domain.validation.adapter 100% 0
domain.validation.command 0% 0
domain.validation.event 100% 0
domain.validation.handler 90% 0
domain.validation.model 91% 0
domain.validation.port 100% 0
domain.validation.query 100% 0
domain.validation.service 90% 0
domain.validation.util.di 93% 0
infrastructure 80% 0
infrastructure.auth 56% 0
infrastructure.event 75% 0
infrastructure.http 92% 0
infrastructure.index 82% 0
infrastructure.ingest 83% 0
infrastructure.k8s 76% 0
infrastructure.messaging 100% 0
infrastructure.oci 56% 0
infrastructure.persistence 68% 0
infrastructure.persistence.adapter 52% 0
infrastructure.persistence.mappers 62% 0
infrastructure.persistence.repository 31% 0
infrastructure.s3 26% 0
infrastructure.storage 100% 0
sdk 100% 0
sdk.index 100% 0
util 100% 0
util.di 77% 0
Summary 75% (6956 / 9273) 0

@rorybyrne
Copy link
Copy Markdown
Contributor Author

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR delivers three tightly coupled features: typed metadata.<slug>_v<major> tables generated from FieldDefinition[], a compound FilterExpr DSL (nested AND/OR/NOT + predicates) replacing the flat filter list, and synchronous dual-write inside RecordService so both records (JSONB) and the typed table land in one transaction. The three previously-flagged issues — DDL injection, TOCTOU race on ensure_table, and date-string ValueError surfacing as 500 — are all addressed.

  • P1: In 076_add_feature_tables_record_srn_fks.py, FK_NAME_TEMPLATE = \"fk_features_{hook}_record_srn\" has 23 chars of fixed overhead; hook names of 41–63 chars produce an FK name exceeding PostgreSQL's 63-char identifier limit, causing _safe_ident to throw ValueError and abort the migration.

Confidence Score: 4/5

Safe to merge after addressing the FK name overflow in the feature-FKs migration; all other findings are P2.

One P1 issue: the FK constraint name template can exceed PostgreSQL's 63-char identifier limit for hooks with names > 40 chars, causing _safe_ident to raise ValueError and abort the migration. The three previously-reported P1/P0 issues (DDL injection, TOCTOU race, date ValueError) are all addressed. Remaining findings are P2.

server/migrations/versions/076_add_feature_tables_record_srn_fks.py — FK name length overflow for long hook names

Important Files Changed

Filename Overview
server/migrations/versions/076_add_feature_tables_record_srn_fks.py Adds ON DELETE CASCADE FKs to feature tables; fk_name generated from FK_NAME_TEMPLATE can exceed 63-char PG identifier limit for hook names > 40 chars, causing _safe_ident to throw ValueError and abort migration (P1)
server/osa/infrastructure/persistence/adapter/discovery.py Core query compiler for compound FilterExpr → SQL; correctly handles Not/NEQ outer-join NULL semantics via coalesce/or-is-null; IN on outer-joined columns has inconsistent NULL treatment (P2)
server/osa/infrastructure/persistence/metadata_store.py New typed metadata DDL + DML store; addresses DDL injection (pg_ident validation), TOCTOU race (advisory lock), and date coercion errors (wrapped as ValidationError); additive-evolution contract is well-enforced
server/osa/domain/discovery/service/discovery.py Compound FilterExpr validation: tree-depth, predicate-count, cross-domain join limits, and operator-type compatibility are all enforced; allow_compound flag defaults to True making the staged gate a no-op (P2)
server/osa/domain/record/service/record.py Synchronous dual-write (records + typed metadata table) inside bulk_publish/publish_record; schema_id resolved once per convention batch, written atomically
server/osa/infrastructure/persistence/metadata_table.py New helper for dynamic metadata Table objects; mirrors feature_table pattern; PG identifier length checked at boundary; record_srn carries ON DELETE CASCADE FK
server/osa/domain/discovery/model/value.py FilterExpr discriminated union (And/Or/Not/Predicate) with Pydantic; new operators (NEQ, GT, LT, IN, IS_NULL); VALID_OPERATORS and JSON_TYPE_OPERATORS expanded; model_rebuild() called for forward references
server/migrations/versions/076_schemas_to_id.py Splits schemas.srn into (id, version) PK and conventions.schema_srn into (schema_id, schema_version); documented greenfield-only with no backfill; correctly fails on populated DBs at SET NOT NULL

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[POST /discovery/records\nRecordSearchRequest] -->|schema, filter, q| B[_parse_schema]
    B --> C[DiscoveryService.search_records]
    C --> D{schema_id?}
    D -->|yes| E[field_reader.get_fields_for_schema]
    D -->|no| F[skip typed table]
    E --> G[_validate_tree\ndepth / predicates / joins]
    G --> H[_validate_refs\nfield + operator checks]
    H --> I[DiscoveryReadStore.search_records]
    F --> I
    I --> J{schema_id?}
    J -->|yes| K[_metadata_catalog_for\nbuild_metadata_table]
    J -->|no| L[JSONB projection\nrecords.metadata]
    K --> M[_collect_feature_joins\nbuild_feature_table per hook]
    M --> N[_compile_filter_for_records\nAnd/Or/Not/Predicate → SQL]
    N --> O[INNER JOIN metadata table\n+ OUTER JOIN feature tables]
    L --> P[no typed table join\ncanonical JSONB path]
    O --> Q[keyset paginated result]
    P --> Q

    R[RecordService.bulk_publish] --> S[_resolve_schema_id\nconvention → schema_id]
    S --> T[record_repo.save_many\nINSERT records]
    T --> U[metadata_service.insert_many\ntyped table dual-write]
    U --> V{same transaction}
    T --- V
Loading

Reviews (5): Last reviewed commit: "feat: improve error handling for date/da..." | Re-trigger Greptile

Comment on lines +34 to +48
op.execute(
"""
UPDATE records r
SET
schema_id = c.schema_id,
schema_version = c.schema_version
FROM conventions c
WHERE c.srn = r.convention_srn
AND r.schema_id IS NULL
"""
)

op.alter_column("records", "schema_id", nullable=False)
op.alter_column("records", "schema_version", nullable=False)
op.create_index("idx_records_schema_id", "records", ["schema_id"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Migration can fail on orphaned records

The backfill UPDATE records … FROM conventions WHERE c.srn = r.convention_srn leaves schema_id NULL for any record whose convention_srn no longer exists in the conventions table (e.g. records from deleted conventions). The immediately following op.alter_column("records", "schema_id", nullable=False) will then raise NOT NULL violation and abort the migration.

A safer approach is to add a pre-flight assertion or a fallback default so rows that fail the join don't silently remain NULL when the constraint is applied.

…d-only deployment

Remove backfill logic from schema migration scripts to support
greenfield deployments only. Migrations now fail with clear
constraint errors on populated databases, signaling data predates
the new schema structure.

feat: add SQL injection protection for metadata store DDL operations

Add _safe_ident function to validate PostgreSQL identifiers before
interpolation into raw DDL statements to prevent SQL injection
attacks through malicious column names.

feat: add advisory lock for concurrent metadata table creation

Implement pg_advisory_xact_lock to serialize concurrent ensure_table
calls for the same schema, preventing race conditions that could
cause DuplicateTable errors during simultaneous convention
registration.

fix: restore metadata schema creation in integration test cleanup

Re-create empty metadata and features schemas after DROP CASCADE
in test cleanup to maintain production migration invariants.
@rorybyrne
Copy link
Copy Markdown
Contributor Author

@greptile

Comment on lines +471 to +473
feature_tables_table.c.feature_schema,
).where(feature_tables_table.c.hook_name.in_(hooks))
result = await self.session.execute(stmt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 INNER JOIN silently excludes pre-migration records from schema-pinned queries

search_records uses a regular (inner) join when a schema is pinned:

stmt = select(*select_cols).select_from(
    t.join(metadata_table, metadata_table.c.record_srn == t.c.srn)
)

Migration 076_add_records_schema_srn.py backfills records.schema_id for all existing records, which means callers can legitimately construct a valid schema_id for pre-existing data. However, since metadata.<slug>_v<major> rows are only written on dual-write publication (post-deploy), those records have no row in the typed table and are silently dropped by the inner join. A schema-pinned query against any pre-migration data returns an empty result set with no indication of why.

If this exclusion is intentional (fresh-ingest-only semantic), it should be documented in the migration and surfaced as a 0-count warning to callers. If historical records are meant to be reachable, the migration needs a backfill pass into the typed tables.

Add regex validation to prevent SQL injection attacks when
interpolating hook names into DDL statements. Validates identifiers
match PostgreSQL naming conventions before use.

feat: add schema validation for discovery operations

Raise NotFoundError when pinned schema is not found instead of
silently falling back to unscoped queries that could return
cross-schema records.

docs: clarify field definition reader behavior

Update docstring to explain that callers must handle empty results
and raise NotFoundError themselves when treating unknown schemas
as error conditions.
@rorybyrne
Copy link
Copy Markdown
Contributor Author

@greptile

Comment on lines +262 to +263
for hook, ft in feature_joins.items():
stmt = stmt.join(ft, ft.c.record_srn == t.c.srn, isouter=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Not and NEQ on outer-joined feature columns produce silent incorrect results

Feature tables are outer-joined to records (line 263: isouter=True) so that records without a feature row still appear in unfiltered listings. But this breaks Not and NEQ predicates on feature columns at the SQL level:

  • Record with no feature row → column is NULL
  • NULL != 'foo'NULL (falsy in WHERE) → record silently excluded
  • NOT (NULL = 'foo')NOT NULLNULL (falsy) → record silently excluded

A user querying NOT (features.quality.score eq 10.0) expects to receive every record that doesn't have that score — including records that have no quality feature row at all. Instead, they get only records that have a quality row with a different score.

The fix depends on desired semantics. For NEQ, generating (col <> val OR col IS NULL) restores the expected behavior. For Not, the inner predicate's column needs the same null-coalescing treatment before negation, or the feature join should switch to isouter=False only when the predicate is purely affirmative.

…cord exclusion

Wrap NOT operations with coalesce to treat NULL as FALSE before negation and modify NEQ operator to explicitly include NULL values, ensuring records with missing feature/metadata values are not silently dropped from query results.
@rorybyrne
Copy link
Copy Markdown
Contributor Author

@greptile

Replace assert statements with proper exception handling to provide
clearer error messages when field reference types are unexpected,
improving debugging and error reporting.

docs: update record mapper comment to reflect synchronous metadata

Update comment to accurately describe that metadata projection is
written synchronously in the same transaction rather than
asynchronously.
Comment on lines +342 to +348
def _column_type_sql(sa_type: Any) -> str:
if isinstance(sa_type, sa.Text):
return "text"
if isinstance(sa_type, sa.DateTime):
return "timestamp with time zone" if sa_type.timezone else "timestamp"
if isinstance(sa_type, sa.Date):
return "date"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Bare ValueError from invalid date strings surfaces as 500

date.fromisoformat and datetime.fromisoformat raise ValueError on malformed input (e.g. "2026-99-99", "not-a-date"). That exception is not caught here, so a user submitting a record with a bad date string in a DATE-typed metadata field gets an unhandled 500 instead of a 400 ValidationError.

Suggested change
def _column_type_sql(sa_type: Any) -> str:
if isinstance(sa_type, sa.Text):
return "text"
if isinstance(sa_type, sa.DateTime):
return "timestamp with time zone" if sa_type.timezone else "timestamp"
if isinstance(sa_type, sa.Date):
return "date"
def _coerce_value(col: ColumnDef, value: Any) -> Any:
if value is None:
return None
try:
if col.json_type == "string" and col.format == "date":
return value if isinstance(value, date) else date.fromisoformat(value)
if col.json_type == "string" and col.format == "date-time":
return value if isinstance(value, datetime) else datetime.fromisoformat(value)
except (ValueError, TypeError) as exc:
raise ValidationError(
f"Field '{col.name}' has an invalid {col.format!r} value {value!r}: {exc}",
field=col.name,
) from exc
return value

…tore

Add ValidationError for malformed ISO date/datetime strings to return
400 status instead of 500, and include record SRN context in error
messages for better debugging
@rorybyrne
Copy link
Copy Markdown
Contributor Author

@greptile

Comment on lines +51 to +52
exists = conn.execute(_check_constraint(fk_name)).scalar()
if exists:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 FK constraint name exceeds 63-char PG identifier limit for long hook names

FK_NAME_TEMPLATE = "fk_features_{hook}_record_srn" has 23 chars of fixed overhead ("fk_features_" = 12, "_record_srn" = 11). The _PG_IDENT_RE regex allows hook names up to 63 chars, so the rendered FK name can reach 86 chars. _safe_ident(fk_name) will reject any name over 63 chars with a ValueError, aborting the migration for any hook whose name exceeds 40 characters.

fk_name = _safe_ident(FK_NAME_TEMPLATE.format(hook=hook))
# e.g. hook = "computational_protein_structure_analysis_v2"  (42 chars)
# → "fk_features_computational_protein_structure_analysis_v2_record_srn"  (66 chars) → ValueError

Consider truncating or hashing the hook name when the full FK name would overflow:

MAX_HOOK_IN_FK = 63 - len("fk_features__record_srn")  # = 40
safe_hook = hook if len(hook) <= MAX_HOOK_IN_FK else hook[:36] + "_" + hex(hash(hook) & 0xFFFF)[2:]
fk_name = f"fk_features_{safe_hook}_record_srn"

Introduce HookName type to enforce 40-character limit on hook names,
ensuring derived identifiers like fk_features_{name}_record_srn stay
within PostgreSQL's 63-character identifier limit. Update HookDefinition
to use HookName instead of PgIdentifier while keeping ColumnDef unchanged.
@rorybyrne rorybyrne merged commit a747fc1 into main Apr 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: typed metadata tables + expressive REST query DSL

1 participant