Skip to content

feat!: data access v3#1349

Merged
ekremney merged 23 commits intomainfrom
v3-postgrest-data-access
Feb 16, 2026
Merged

feat!: data access v3#1349
ekremney merged 23 commits intomainfrom
v3-postgrest-data-access

Conversation

@ekremney
Copy link
Copy Markdown
Member

@ekremney ekremney commented Feb 13, 2026

V3 Data Access Migration: DynamoDB/ElectroDB -> Postgres/PostgREST

BREAKING CHANGE

Summary

This PR moves @adobe/spacecat-shared-data-access to a v3 Postgres/PostgREST-backed data layer while preserving the external data-access API shape for consumers.

It removes DynamoDB-backed integration coverage and replaces it with a PostgREST integration harness using mysticat-data-service.

Why

  • V2 is DynamoDB/ElectroDB-based.
  • V3 target datastore is Postgres via PostgREST.
  • Consumers should keep calling the same data-access APIs (collection/model methods), while transport/storage moves to PostgREST.

What Changed

Core data-access migration

  • Added PostgREST service/client wiring via @supabase/postgrest-js.
  • Reworked base collection/model plumbing for PostgREST query/CRUD paths.
  • Added DB/model field conversion utilities in src/util/postgrest.utils.js:
    • camelCase -> snake_case for writes/filters
    • snake_case -> camelCase for reads
    • explicit handling for id aliasing (e.g. siteId <-> id)
  • Added/updated unit tests for:
    • PostgREST utility mapping
    • base collection PostgREST behavior
    • service initialization and configuration

V3 special-case behavior

  • KeyEvent: deprecated in v3, collection methods throw a deprecation error.
  • LatestAudit: no dedicated table in Postgres; implemented as virtual behavior derived from Audit query/grouping logic.
  • Configuration: remains S3-backed (unchanged behavior).

Integration tests replacement

  • Removed legacy DynamoDB IT suite under test/it.
  • Added PostgREST IT harness:
    • test/it/postgrest/docker-compose.yml
    • test/it/fixtures.js global setup/teardown to:
      1. start Postgres
      2. run dbmate up using mysticat-data-service
      3. seed tenant SQL fixtures
      4. start PostgREST from mysticat-data-service
  • Added new IT smoke tests:
    • camel/snake field translation validation (base_url -> baseURL)
    • LatestAudit virtual behavior
    • KeyEvent deprecation behavior
  • Added anonymized tenant seed fixture:
    • test/it/seed/tenants/01_tenant_alpha.sql

Image/Runtime Notes

  • IT compose defaults to latest known release tag from mysticat-data-service:
    • 682033462621.dkr.ecr.us-east-1.amazonaws.com/mysticat-data-service:v1.7.1
  • Override with env var if needed:
    • MYSTICAT_DATA_SERVICE_IMAGE=<your-image:tag>

Validation

  • Unit tests updated for new PostgREST behavior.
  • Integration tests (npm run test:it) pass locally when Docker can access required images.

Operational Notes

  • Pulling the default data-service image requires ECR auth in local environments.
  • For local runs:
    • aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 682033462621.dkr.ecr.us-east-1.amazonaws.com

Breaking/Behavioral Notes

  • This is v3-oriented work; no ElectroDB compatibility layer is kept.
  • KeyEvent APIs now intentionally error as deprecated.
  • LatestAudit is now query-derived from audits instead of persisted as its own table.

Follow-ups (if needed)

  • Add ECR login preflight check in IT bootstrap for clearer failures.
  • Expand PostgREST IT coverage beyond smoke tests as v3 rollout stabilizes.

@ekremney ekremney requested a review from solaris007 as a code owner February 13, 2026 15:49
@ekremney ekremney force-pushed the v3-postgrest-data-access branch from 576935d to 52d8ebd Compare February 13, 2026 15:54
@github-actions
Copy link
Copy Markdown

This PR will trigger a minor release when merged.

@ekremney ekremney changed the title V3 postgrest data access feat! v3 postgrest data access Feb 13, 2026
@ekremney ekremney changed the title feat! v3 postgrest data access feat!: v3 postgrest data access Feb 13, 2026
@ekremney ekremney changed the title feat!: v3 postgrest data access feat!: data access v3 Feb 13, 2026
@solaris007
Copy link
Copy Markdown
Member

solaris007 commented Feb 13, 2026

Hey @ekremney :)

Context applied: v3 is a temporary bridge layer (v4 will be PostgREST-native). Aurora is VPC-internal with Vault-managed credentials. No auth middleware exists by design. Proper indexes exist on the audits table.

Note: Several findings from our initial review were addressed by commits pushed after a205b71d (DynamoDB dep removal, expanded IT coverage, parameterized ECR image). The findings below reflect current state at b23a2d1.


1. CRITICAL: LatestAudit fetchAllPages bypasses existing database indexes

File: latest-audit.collection.js - #allAuditsByKeys

LatestAuditCollection calls auditCollection.allByIndexKeys() with fetchAllPages: true, pulling every audit row through PostgREST pagination, deserializing into JS objects, then grouping in-memory via #groupLatest. The audits table has indexes designed for exactly this query pattern (idx_audits_site_type_time on (site_id, audit_type, audited_at DESC)), but they are completely bypassed.

This is not a gradual degradation - it works until the table crosses a size threshold, then OOMs or times out.

Remediation: Create a latest_audits view in mysticat-data-service:

CREATE VIEW latest_audits AS
SELECT DISTINCT ON (site_id, audit_type) *
FROM audits
ORDER BY site_id, audit_type, audited_at DESC;

The existing indexes fully support this. The team already has materialized views (brand_presence) showing this is a known pattern.

Recommendation: Block or committed fast-follow with a Jira ticket before v3 hits production traffic.


2. IMPORTANT: N+1 HTTP patterns in batch operations

Three methods use Promise.all(keys.map(...)) issuing N individual PostgREST HTTP requests where batch operations are available:

  • batchGetByKeys (base.collection.js) - N individual findByIndexKeys calls. PostgREST supports .in() filter for single-request batch (already used correctly in removeByIds).
  • _saveMany - N individual updateByKeys calls. PostgREST supports bulk PATCH.
  • removeByIndexKeys - N individual DELETE queries.

Works fine with small key sets in testing. Degrades linearly with 100+ keys in production (connection pool pressure on Aurora, latency multiplication).

Recommendation: Non-blocking, but address before production scale.


3. IMPORTANT: No null guard on postgrestService

File: base.collection.js constructor

postgrestService is assigned without validation. If undefined due to config issue or init failure, collections instantiate silently and the first query throws an unhelpful TypeError deep in business logic.

Remediation: Trivial fix - if (!postgrestService) throw new DataAccessError('postgrestService is required') in the constructor.


4. MEDIUM: applyWhere silently drops unsupported operators

File: postgrest.utils.js - applyWhere

Only eq and contains operators are supported. If a caller passes any other operator type, the function silently returns the unfiltered query (confirmed by test case for { type: 'unknown' }). The caller receives ALL records believing they are filtered. This is a data correctness risk.

Remediation: Throw or log a warning for unsupported operator types instead of silently returning unfiltered results.


5. INFORMATIONAL: Dual-path code with electrodb removed

The if (this.entity) branches (~10 points in base.collection.js) are still present, but electrodb was removed from dependencies in b73bfa6. The ElectroValidationError import was replaced with a duck-type name check. Is this intentional dead code awaiting cleanup, or should it be stripped in this PR?


6. INFORMATIONAL: Semantic-release mismatch

PR title is feat!: data access v3 (breaking = major release), but the bot comment says "minor release". May need a config check.


Summary

# Severity Finding Blocking?
1 CRITICAL LatestAudit full table scan Yes - fix or fast-follow
2 IMPORTANT N+1 batch operations Non-blocking, track
3 IMPORTANT No null guard on postgrestService Non-blocking, easy fix
4 MEDIUM applyWhere silent operator drop Non-blocking, track
5 INFO Dual-path code + electrodb removed Clarification needed
6 INFO Semantic-release version mismatch Check config

Overall this is a well-structured bridge layer. The DynamoDB dep cleanup, expanded IT coverage, and parameterized ECR image in the recent commits are great. The single blocking concern is the LatestAudit full table scan (#1) with a straightforward DB-side fix.

@ekremney
Copy link
Copy Markdown
Member Author

@solaris007 Re: #1349 (comment)

Thanks for the review. Several of the “this will break” points were directionally right, and the concrete regressions are now addressed in this branch.

Since that comment, commit caa3e792 landed with:

  • explicit constructor guard for missing PostgREST service (postgrestService is required),
  • applyWhere now fails on unsupported operators (no silent unfiltered fallback),
  • reduced N+1 behavior in common batch paths:
    • batchGetByKeys now uses single .in() for homogeneous single-field keys (with safe fallback),
    • removeByIndexKeys now uses bulk .delete().in() for the same pattern,
  • updated/added unit coverage for all above.

Validation:

  • full integration suite is green: 680 passing.

Still intentionally out-of-scope for this PR:

  • LatestAudit full-scan fix will be done in a separate mysticat-data-service PR via DB-side view/RPC and then wired here,
  • _saveMany bulk PATCH optimization can be a follow-up hardening item.

So yes, useful review, and yes, the high-signal actionable items are now fixed with tests and passing IT.

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the PostgREST migration approach. Found a critical ordering bug with composite keys, an unsafe localhost default, and a duplicate sort clause issue. See inline comments for details.

Comment thread packages/spacecat-shared-data-access/src/models/base/base.collection.js Outdated
Comment thread packages/spacecat-shared-data-access/src/index.js Outdated
Comment thread packages/spacecat-shared-data-access/src/models/base/base.collection.js Outdated
solaris007 added a commit that referenced this pull request Feb 15, 2026
…ocker-compose

Update the default image reference from local mysticat-data-service:test
to the ECR-hosted 682033462621.dkr.ecr.us-east-1.amazonaws.com/mysticat-data-service:v1.8.0,
matching PR #1349 and ensuring consistency with CI.
@ekremney
Copy link
Copy Markdown
Member Author

Closing due to duplicate effort: #1351

@ekremney ekremney closed this Feb 16, 2026
@solaris007 solaris007 reopened this Feb 16, 2026
@solaris007
Copy link
Copy Markdown
Member

This is what makes this PR viable again: #1351 (comment)

@ekremney ekremney requested a review from solaris007 February 16, 2026 11:45
@ekremney
Copy link
Copy Markdown
Member Author

@solaris007 could you please re-review?

@solaris007 solaris007 added the enhancement New feature or request label Feb 16, 2026
Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ekremney

Approving conditionally - the architecture and migration approach are solid, but the items below should be addressed before merge.

Previous Review Issues - All 3 Fixed

  1. Composite key ordering - Fixed. #getOrderFields() now returns all index key fields.
  2. Localhost default - Fixed. Throws new Error('POSTGREST_URL is required') if env var missing.
  3. Duplicate ORDER BY - Fixed. Tiebreaker checks if id field is already in the order list.

Critical

C1. Presigned S3 URLs with AWS credentials in seed SQL
test/it/seed/tenants/01_tenant_alpha.sql contains presigned S3 URLs pointing to spacecat-prod-scraper.s3.us-east-1.amazonaws.com with embedded X-Amz-Credential, X-Amz-Security-Token, and X-Amz-Signature values. Even though these are expired STS credentials, they leak production bucket names, IAM credential patterns, and internal path structure into the public repo. Replace with placeholder URLs like https://example.com/screenshot.png.

C2. all-collections-methods-coverage.test.js swallows all errors
The coverage test wraps every method call in try/catch and only asserts expect(error).to.be.instanceOf(Error). Any method that throws for ANY reason (bug, missing table, connection error) is treated as a pass. This test cannot distinguish between "works correctly" and "completely broken" - it provides false confidence.

C3. Triggers never re-enabled after seeding
seed.js calls setPostgresTriggersEnabled(false) before seeding but never calls setPostgresTriggersEnabled(true) afterward. Integration tests run against a database with all triggers disabled, potentially masking bugs related to cascading updates, computed columns, or trigger-based timestamps.


Important

I1. applyWhere only supports eq and contains operators
ElectroDB supported richer where callbacks. Consumers using complex filters will either get no filtering or throw. Needs either documentation of supported operators or expansion.

I2. Patcher save() reverts updatedAt after persisting
In-memory model has stale updatedAt after save. If consumer code reads getUpdatedAt() after calling save(), they get the old timestamp, not the one persisted to the database.

I3. normalizeModelValue silently converts [null] to undefined
Undocumented transformation. Needs a comment explaining which Postgres/PostgREST behavior it compensates for. Also unclear what happens with [null, null] or mixed arrays.

I4. LatestAuditCollection loads ALL audits into memory for single-key queries
When only siteId or auditType is provided, #allAuditsByKeys uses fetchAllPages: true, loading potentially thousands of audits just to find the latest one. Performance risk for high-volume sites. Consider a PostgREST DISTINCT ON approach or server-side limit.

I5. ScrapeJob options null safety in computed attribute setters
optEnableJavascript and optHideConsentBanner setters access options[...] without a null guard. If options is undefined (allowed since it has no default), this throws TypeError at runtime. Fix: options?.[ScrapeJob.ScrapeOptions.ENABLE_JAVASCRIPT].

I6. LatestAudit sort order may differ from v2
allByIndexKeys sorts ascending (a.localeCompare(b)) on auditedAt. If v2 returned descending (most recent first), this is a behavioral change that could break consumers. Verify downstream expectations.

I7. _saveMany issues individual HTTP requests for updates
Unlike batch creates (single .insert() call), updates go one-by-one via Promise.all. For large batches this means N separate HTTP requests. PostgREST supports bulk UPSERT which could be leveraged.

I8. Static mutable state in EntityRegistry and LoggerRegistry
No cleanup/reset mechanism. Entity registrations persist across test cases. Multiple createDataAccess() calls with different loggers - last one wins globally.

I9. Removed DynamoDB IT tests covered entity-specific behaviors not fully replaced
Old per-entity tests (ApiKey hash validation, AsyncJob status transitions, ImportJob URL tracking, etc.) are replaced by a generic method coverage test that cannot verify correctness (see C2).


Suggestions

  • S1: camelToSnake/snakeToCamel roundtrip is lossy for acronyms (baseURL -> base_url -> baseUrl). Schema-registered fields are safe via explicit maps, but fallback conversion for unknown fields will silently mis-map.
  • S2: Docker ECR image dependency makes IT tests non-portable for external contributors. Document the ECR login requirement prominently.
  • S3: Hardcoded TEST_IDS in helpers.js must stay in sync with seed SQL - fragile coupling with no validation.
  • S4: DEFAULT_PAGE_SIZE = 1000 may be too large for PostgREST/Postgres memory. Consider 100-250 or per-entity config.
  • S5: No request timeout, retry, or connection keep-alive config for the PostgREST client.
  • S6: KeyEvent deprecation error message doesn't tell consumers what to do instead. Consider actionable guidance in the message.
  • S7: SiteEnrollment create does a full table scan via allBySiteId for dedup - consider a unique constraint at the DB level.
  • S8: isMissingDbFieldError in seed.js does brittle string matching on PostgREST error messages - consider checking error codes instead.
  • S9: Add a smoke test that verifies all TEST_IDS exist in seed data for faster debugging when Docker environment is misconfigured.
  • S10: FixEntitySuggestion getId() behavior with composite keys should be verified - fixEntitySuggestionId is required: false + postgrestIgnore: true, which means getId() may always return undefined.

@ekremney
Copy link
Copy Markdown
Member Author

Addressed the requested follow-ups in commit ff52726.

  • I6: LatestAudit grouped results now default to descending order (v2 parity), with unit coverage.
  • I9: Added targeted PostgREST IT regression coverage (entity-parity-regressions.test.js) for ApiKey hashed lookup, AsyncJob validation behavior, and ImportJob<->ImportUrl relation traversal.
  • S7: SiteEnrollment dedup now uses direct composite-key lookup (siteId + entitlementId) instead of scanning allBySiteId.
  • S8: Seed missing-field detection now checks PostgREST/Postgres error codes (PGRST204, 42703) with message fallback.
  • S9: Added seed integrity smoke test (seed-integrity.test.js) to verify all helper TEST_IDS exist.
  • S10: FixEntitySuggestion getId() now returns explicit id when present, else deterministic synthetic composite id.

Also includes previously discussed review fixes (C1/C2/C3/I1/I2/I3/I4/I5/I7/I8/S6) in the same commit set.

Validation run: npm run lint and npm test both pass locally.
PostgREST IT now authenticates to private ECR successfully via spacecat-dev profile, but the suite currently fails in all-collections-methods-coverage due argument-validation regressions unrelated to image pull auth.

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Celebrate

@ekremney ekremney merged commit 6db6b79 into main Feb 16, 2026
7 checks passed
@ekremney ekremney deleted the v3-postgrest-data-access branch February 16, 2026 15:45
solaris007 pushed a commit that referenced this pull request Feb 16, 2026
## [@adobe/spacecat-shared-data-access-v3.0.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-data-access-v2.109.0...@adobe/spacecat-shared-data-access-v3.0.0) (2026-02-16)

### ⚠ BREAKING CHANGES

* **data-access:** data-access v3 migrates from DynamoDB/ElectroDB to Postgres/PostgREST

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* data access v3 (#1349)

### Features

* data access v3 ([#1349](#1349)) ([6db6b79](6db6b79))
* **data-access:** v3 Postgres/PostgREST migration ([b79725f](b79725f))
@solaris007
Copy link
Copy Markdown
Member

🎉 This PR is included in version @adobe/spacecat-shared-data-access-v3.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants