feat: typed metadata tables + compound discovery DSL (#76)#128
feat: typed metadata tables + compound discovery DSL (#76)#128
Conversation
Implement feature 076 with compound FilterExpr trees (And/Or/Not/Predicate) replacing flat Filter lists, typed field references (metadata.field, features.hook.column), and per-schema metadata tables for optimized queries. Add MetadataStore port with PostgreSQL implementation for schema-keyed DDL lifecycle, EnsureMetadataTable/InsertRecordMetadata event handlers, and discovery adapter support for cross-domain joins with configurable bounds. test: add comprehensive integration and unit tests for metadata system Add extensive test coverage for metadata table management, discovery filtering, and schema evolution including: - Cross-domain JOIN tests between records and features - Typed metadata table creation and evolution validation - Filter expression parsing and validation with AND/OR/NOT support - Metadata service integration with PostgreSQL backend - Schema evolution bounds checking and non-additive rejection - Event handler tests for metadata table lifecycle - Field reference resolution and cursor encoding tests
Replace URN-based schema references with short-form `(id, version)` pairs to simplify internal operations while reserving full URNs for federation edges. Key changes: - Replace `SchemaSRN` with `SchemaId` in domain models and services - Update database schema: `schemas.srn` → `schemas.id` + `schemas.version` - Update `conventions.schema_srn` → `conventions.schema_id` + `conventions.schema_version` - Update `records.schema_srn` → `records.schema_id` + `records.schema_version` - Add `SchemaIdentifier` type for human-readable schema slugs - Update metadata table catalog to use `schema_id` instead of `schema_identity` - Require pinned schema for metadata filtering (remove JSONB fallback) - Add schema ID validation and coercion in API layer - Update discovery queries to work with short-form schema references refactor: replace SchemaSRN with SchemaId across domain models Replace SchemaSRN with simplified SchemaId that only contains id and version without full URN structure. Update all references in models, services, tests and handlers to use the new SchemaId format.
Move metadata table creation and row insertion from event handlers to inline operations within ConventionService and RecordService to eliminate race conditions and ensure atomicity between canonical records and typed metadata tables.
|
|
@greptile |
Greptile SummaryThis PR delivers three tightly coupled features: typed
Confidence Score: 4/5Safe to merge after addressing the FK name overflow in the feature-FKs migration; all other findings are P2. One P1 issue: the FK constraint name template can exceed PostgreSQL's 63-char identifier limit for hooks with names > 40 chars, causing _safe_ident to raise ValueError and abort the migration. The three previously-reported P1/P0 issues (DDL injection, TOCTOU race, date ValueError) are all addressed. Remaining findings are P2. server/migrations/versions/076_add_feature_tables_record_srn_fks.py — FK name length overflow for long hook names
|
| Filename | Overview |
|---|---|
| server/migrations/versions/076_add_feature_tables_record_srn_fks.py | Adds ON DELETE CASCADE FKs to feature tables; fk_name generated from FK_NAME_TEMPLATE can exceed 63-char PG identifier limit for hook names > 40 chars, causing _safe_ident to throw ValueError and abort migration (P1) |
| server/osa/infrastructure/persistence/adapter/discovery.py | Core query compiler for compound FilterExpr → SQL; correctly handles Not/NEQ outer-join NULL semantics via coalesce/or-is-null; IN on outer-joined columns has inconsistent NULL treatment (P2) |
| server/osa/infrastructure/persistence/metadata_store.py | New typed metadata DDL + DML store; addresses DDL injection (pg_ident validation), TOCTOU race (advisory lock), and date coercion errors (wrapped as ValidationError); additive-evolution contract is well-enforced |
| server/osa/domain/discovery/service/discovery.py | Compound FilterExpr validation: tree-depth, predicate-count, cross-domain join limits, and operator-type compatibility are all enforced; allow_compound flag defaults to True making the staged gate a no-op (P2) |
| server/osa/domain/record/service/record.py | Synchronous dual-write (records + typed metadata table) inside bulk_publish/publish_record; schema_id resolved once per convention batch, written atomically |
| server/osa/infrastructure/persistence/metadata_table.py | New helper for dynamic metadata Table objects; mirrors feature_table pattern; PG identifier length checked at boundary; record_srn carries ON DELETE CASCADE FK |
| server/osa/domain/discovery/model/value.py | FilterExpr discriminated union (And/Or/Not/Predicate) with Pydantic; new operators (NEQ, GT, LT, IN, IS_NULL); VALID_OPERATORS and JSON_TYPE_OPERATORS expanded; model_rebuild() called for forward references |
| server/migrations/versions/076_schemas_to_id.py | Splits schemas.srn into (id, version) PK and conventions.schema_srn into (schema_id, schema_version); documented greenfield-only with no backfill; correctly fails on populated DBs at SET NOT NULL |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[POST /discovery/records\nRecordSearchRequest] -->|schema, filter, q| B[_parse_schema]
B --> C[DiscoveryService.search_records]
C --> D{schema_id?}
D -->|yes| E[field_reader.get_fields_for_schema]
D -->|no| F[skip typed table]
E --> G[_validate_tree\ndepth / predicates / joins]
G --> H[_validate_refs\nfield + operator checks]
H --> I[DiscoveryReadStore.search_records]
F --> I
I --> J{schema_id?}
J -->|yes| K[_metadata_catalog_for\nbuild_metadata_table]
J -->|no| L[JSONB projection\nrecords.metadata]
K --> M[_collect_feature_joins\nbuild_feature_table per hook]
M --> N[_compile_filter_for_records\nAnd/Or/Not/Predicate → SQL]
N --> O[INNER JOIN metadata table\n+ OUTER JOIN feature tables]
L --> P[no typed table join\ncanonical JSONB path]
O --> Q[keyset paginated result]
P --> Q
R[RecordService.bulk_publish] --> S[_resolve_schema_id\nconvention → schema_id]
S --> T[record_repo.save_many\nINSERT records]
T --> U[metadata_service.insert_many\ntyped table dual-write]
U --> V{same transaction}
T --- V
Reviews (5): Last reviewed commit: "feat: improve error handling for date/da..." | Re-trigger Greptile
| op.execute( | ||
| """ | ||
| UPDATE records r | ||
| SET | ||
| schema_id = c.schema_id, | ||
| schema_version = c.schema_version | ||
| FROM conventions c | ||
| WHERE c.srn = r.convention_srn | ||
| AND r.schema_id IS NULL | ||
| """ | ||
| ) | ||
|
|
||
| op.alter_column("records", "schema_id", nullable=False) | ||
| op.alter_column("records", "schema_version", nullable=False) | ||
| op.create_index("idx_records_schema_id", "records", ["schema_id"]) |
There was a problem hiding this comment.
Migration can fail on orphaned records
The backfill UPDATE records … FROM conventions WHERE c.srn = r.convention_srn leaves schema_id NULL for any record whose convention_srn no longer exists in the conventions table (e.g. records from deleted conventions). The immediately following op.alter_column("records", "schema_id", nullable=False) will then raise NOT NULL violation and abort the migration.
A safer approach is to add a pre-flight assertion or a fallback default so rows that fail the join don't silently remain NULL when the constraint is applied.
…d-only deployment Remove backfill logic from schema migration scripts to support greenfield deployments only. Migrations now fail with clear constraint errors on populated databases, signaling data predates the new schema structure. feat: add SQL injection protection for metadata store DDL operations Add _safe_ident function to validate PostgreSQL identifiers before interpolation into raw DDL statements to prevent SQL injection attacks through malicious column names. feat: add advisory lock for concurrent metadata table creation Implement pg_advisory_xact_lock to serialize concurrent ensure_table calls for the same schema, preventing race conditions that could cause DuplicateTable errors during simultaneous convention registration. fix: restore metadata schema creation in integration test cleanup Re-create empty metadata and features schemas after DROP CASCADE in test cleanup to maintain production migration invariants.
|
@greptile |
| feature_tables_table.c.feature_schema, | ||
| ).where(feature_tables_table.c.hook_name.in_(hooks)) | ||
| result = await self.session.execute(stmt) |
There was a problem hiding this comment.
INNER JOIN silently excludes pre-migration records from schema-pinned queries
search_records uses a regular (inner) join when a schema is pinned:
stmt = select(*select_cols).select_from(
t.join(metadata_table, metadata_table.c.record_srn == t.c.srn)
)Migration 076_add_records_schema_srn.py backfills records.schema_id for all existing records, which means callers can legitimately construct a valid schema_id for pre-existing data. However, since metadata.<slug>_v<major> rows are only written on dual-write publication (post-deploy), those records have no row in the typed table and are silently dropped by the inner join. A schema-pinned query against any pre-migration data returns an empty result set with no indication of why.
If this exclusion is intentional (fresh-ingest-only semantic), it should be documented in the migration and surfaced as a 0-count warning to callers. If historical records are meant to be reachable, the migration needs a backfill pass into the typed tables.
Add regex validation to prevent SQL injection attacks when interpolating hook names into DDL statements. Validates identifiers match PostgreSQL naming conventions before use. feat: add schema validation for discovery operations Raise NotFoundError when pinned schema is not found instead of silently falling back to unscoped queries that could return cross-schema records. docs: clarify field definition reader behavior Update docstring to explain that callers must handle empty results and raise NotFoundError themselves when treating unknown schemas as error conditions.
|
@greptile |
| for hook, ft in feature_joins.items(): | ||
| stmt = stmt.join(ft, ft.c.record_srn == t.c.srn, isouter=True) |
There was a problem hiding this comment.
Not and NEQ on outer-joined feature columns produce silent incorrect results
Feature tables are outer-joined to records (line 263: isouter=True) so that records without a feature row still appear in unfiltered listings. But this breaks Not and NEQ predicates on feature columns at the SQL level:
- Record with no feature row → column is
NULL NULL != 'foo'→NULL(falsy inWHERE) → record silently excludedNOT (NULL = 'foo')→NOT NULL→NULL(falsy) → record silently excluded
A user querying NOT (features.quality.score eq 10.0) expects to receive every record that doesn't have that score — including records that have no quality feature row at all. Instead, they get only records that have a quality row with a different score.
The fix depends on desired semantics. For NEQ, generating (col <> val OR col IS NULL) restores the expected behavior. For Not, the inner predicate's column needs the same null-coalescing treatment before negation, or the feature join should switch to isouter=False only when the predicate is purely affirmative.
…cord exclusion Wrap NOT operations with coalesce to treat NULL as FALSE before negation and modify NEQ operator to explicitly include NULL values, ensuring records with missing feature/metadata values are not silently dropped from query results.
|
@greptile |
Replace assert statements with proper exception handling to provide clearer error messages when field reference types are unexpected, improving debugging and error reporting. docs: update record mapper comment to reflect synchronous metadata Update comment to accurately describe that metadata projection is written synchronously in the same transaction rather than asynchronously.
| def _column_type_sql(sa_type: Any) -> str: | ||
| if isinstance(sa_type, sa.Text): | ||
| return "text" | ||
| if isinstance(sa_type, sa.DateTime): | ||
| return "timestamp with time zone" if sa_type.timezone else "timestamp" | ||
| if isinstance(sa_type, sa.Date): | ||
| return "date" |
There was a problem hiding this comment.
Bare
ValueError from invalid date strings surfaces as 500
date.fromisoformat and datetime.fromisoformat raise ValueError on malformed input (e.g. "2026-99-99", "not-a-date"). That exception is not caught here, so a user submitting a record with a bad date string in a DATE-typed metadata field gets an unhandled 500 instead of a 400 ValidationError.
| def _column_type_sql(sa_type: Any) -> str: | |
| if isinstance(sa_type, sa.Text): | |
| return "text" | |
| if isinstance(sa_type, sa.DateTime): | |
| return "timestamp with time zone" if sa_type.timezone else "timestamp" | |
| if isinstance(sa_type, sa.Date): | |
| return "date" | |
| def _coerce_value(col: ColumnDef, value: Any) -> Any: | |
| if value is None: | |
| return None | |
| try: | |
| if col.json_type == "string" and col.format == "date": | |
| return value if isinstance(value, date) else date.fromisoformat(value) | |
| if col.json_type == "string" and col.format == "date-time": | |
| return value if isinstance(value, datetime) else datetime.fromisoformat(value) | |
| except (ValueError, TypeError) as exc: | |
| raise ValidationError( | |
| f"Field '{col.name}' has an invalid {col.format!r} value {value!r}: {exc}", | |
| field=col.name, | |
| ) from exc | |
| return value |
…tore Add ValidationError for malformed ISO date/datetime strings to return 400 status instead of 500, and include record SRN context in error messages for better debugging
|
@greptile |
| exists = conn.execute(_check_constraint(fk_name)).scalar() | ||
| if exists: |
There was a problem hiding this comment.
FK constraint name exceeds 63-char PG identifier limit for long hook names
FK_NAME_TEMPLATE = "fk_features_{hook}_record_srn" has 23 chars of fixed overhead ("fk_features_" = 12, "_record_srn" = 11). The _PG_IDENT_RE regex allows hook names up to 63 chars, so the rendered FK name can reach 86 chars. _safe_ident(fk_name) will reject any name over 63 chars with a ValueError, aborting the migration for any hook whose name exceeds 40 characters.
fk_name = _safe_ident(FK_NAME_TEMPLATE.format(hook=hook))
# e.g. hook = "computational_protein_structure_analysis_v2" (42 chars)
# → "fk_features_computational_protein_structure_analysis_v2_record_srn" (66 chars) → ValueErrorConsider truncating or hashing the hook name when the full FK name would overflow:
MAX_HOOK_IN_FK = 63 - len("fk_features__record_srn") # = 40
safe_hook = hook if len(hook) <= MAX_HOOK_IN_FK else hook[:36] + "_" + hex(hash(hook) & 0xFFFF)[2:]
fk_name = f"fk_features_{safe_hook}_record_srn"Introduce HookName type to enforce 40-character limit on hook names,
ensuring derived identifiers like fk_features_{name}_record_srn stay
within PostgreSQL's 63-character identifier limit. Update HookDefinition
to use HookName instead of PgIdentifier while keeping ColumnDef unchanged.
Closes #76.
Summary
metadata.<slug>_v<major>), generated fromFieldDefinition[]using the same dynamic-DDL pattern as feature tables.FilterExprDSL for/discovery/*(nested AND/OR/NOT + predicates), replacing the flat filter list.<id>@<semver>, e.g.pdb-structure@1.0.0), no server-generated UUIDs.RecordService— bothrecords(JSONB, canonical for cross-schema reads) andmetadata.<slug>_v<major>(typed, canonical for schema-pinned discovery) land in one transaction.What's in this branch
Three commits:
8167ceafeat: add compound filter expressions and typed metadata tables — the original feature.FilterExprdiscriminated union,build_metadata_table(),metadata_tablescatalog, typed/discovery/records+/discovery/features/{hook}compilation, schema-pinned and unscoped paths.a356770refactor: replace schema SRN with short-form schema ID throughout system — internal types useSchemaId("<id>@<semver>");SchemaSRNreserved for federation edges. Schemas now require a user-supplied slug at create time (matched to^[a-z][a-z0-9-]{2,63}$). SDK'sMetadataSchemasubclasses enforce__schema_id__at class-definition time.api_naming.pyseam added for future API↔storage divergence.7041682refactor: replace async metadata projection with synchronous dual-write —RecordService.bulk_publish/publish_recordwrite to bothrecordsand the typed table in one transaction.ConventionService.create_conventioncreates the typed table inline (no race window). DeletesEnsureMetadataTable,InsertRecordMetadata,InsertBatchMetadataevent handlers. PG type constraints become the metadata validator.Architectural notes
GET /records/{srn}, unscoped discovery listings, and federation exports all read fromrecords.metadata. Typed tables serve schema-pinned queries. Both are written atomically so they can't disagree.metadata.<field>predicates, a non-default sort, or free-textq. Plain listings (no filter, default sort, noq) may omit it.Test plan
insert_many, dual-write, synchronous table creation, rollback-on-malformed-data)ruff checkandty checkcleanosa_pdb: fresh ingest populates bothrecordsandmetadata.pdb_structure_v1getRecordByPdbIdreturns the expected record for newly ingested PDB IDsschema=pdb-structure@1.0.0and filter bymetadata.pdb_idMigration
Migrations in this branch (076_*):
076_add_metadata_schema_and_catalog—metadataPG schema +metadata_tablescatalog + pgvector enable + FK onfeatures.*076_schemas_to_id— splitsschemas.srninto(id, version)PK, splitsconventions.schema_srninto(schema_id, schema_version)076_add_records_schema_srn— addsrecords.schema_id+schema_versionwith backfill