Skip to content

Conversation

@johnewart
Copy link
Collaborator

@johnewart johnewart commented Oct 17, 2025

Ticket ENG-1589

Description Of Changes

Adds PostgreSQL migration to create the new consent document table and its partitions.

Code Changes

The migration as mentioned above, but also copies the alembic.ini to the root so you can use it without moving around (also, this is "normally" where alembic puts the configuration file, we might want to move it but I just copied it for now)

Changed both the copies with an updated migration file name template to be date-based instead of just the revision - makes it easier to reason about the history and linear progression of changes to the database.

Steps to Confirm

No manual steps needed

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • Followup issues:
    • Followup issues created
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

…artitioning; copies the alembic.ini to the root so you can use alembic from the root of the project (where it typically goes).
@johnewart johnewart requested a review from galvana October 17, 2025 23:43
@johnewart johnewart requested a review from a team as a code owner October 17, 2025 23:43
@vercel
Copy link

vercel bot commented Oct 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
fides-plus-nightly Ignored Ignored Preview Oct 24, 2025 3:34am
fides-privacy-center Ignored Ignored Oct 24, 2025 3:34am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR introduces a PostgreSQL partitioned table for storing consent documents (v3) and updates the Alembic configuration to use date-based migration file naming for better historical tracking.

Key Changes:

  • Creates consent_documents table with LIST partitioning on is_latest boolean field
  • Adds two partitions: consent_documents_current (for latest records) and consent_documents_historic (for historical records)
  • Copies alembic.ini to project root with date-based file template (xx_YYYY_MM_DD_HHMM-revision-slug)
  • Updates existing src/fides/api/alembic/alembic.ini with same date-based template

Issues Found:

  • Missing primary key constraint on consent_documents table (critical - table won't function properly)
  • Boolean column is_latest should be non-nullable with default to avoid three-state logic
  • JSONB column uses incorrect syntax (postgresql.json.JSONB vs postgresql.JSONB(astext_type=sa.Text()))
  • created_at uses wrong default syntax (default= instead of server_default=)

The alembic.ini configuration changes are good and improve developer experience, but the migration has several critical issues that must be fixed before merging.

Confidence Score: 1/5

  • This PR has critical database schema issues that will cause runtime failures
  • Score reflects multiple critical issues: missing primary key constraint (will cause table operations to fail), incorrect default syntax for created_at timestamp (won't work), nullable boolean used for partitioning (creates ambiguity), and inconsistent JSONB syntax. These are not stylistic concerns but functional errors that will prevent the migration from working correctly.
  • src/fides/api/alembic/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py requires immediate attention to fix critical schema issues

Important Files Changed

File Analysis

Filename Score Overview
src/fides/api/alembic/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py 2/5 Creates partitioned consent_documents table with JSONB search data, TEXT record data, and boolean is_latest field - missing primary key, nullable constraints on critical fields, and proper JSONB syntax
alembic.ini 5/5 New root-level alembic configuration with date-based migration template for easier reasoning about migration history
src/fides/api/alembic/alembic.ini 5/5 Updated with date-based migration file template matching the root alembic.ini configuration

Sequence Diagram

sequenceDiagram
    participant Dev as Developer
    participant Alembic as Alembic CLI
    participant Migration as Migration Script
    participant DB as PostgreSQL Database
    
    Dev->>Alembic: alembic upgrade head
    Alembic->>Migration: Execute upgrade()
    Migration->>DB: CREATE TABLE consent_documents
    Note over DB: Partitioned by LIST(is_latest)
    Migration->>DB: CREATE TABLE consent_documents_current
    Note over DB: PARTITION FOR VALUES IN (true)
    Migration->>DB: CREATE TABLE consent_documents_historic
    Note over DB: PARTITION FOR VALUES IN (false)
    DB-->>Migration: Tables created
    Migration-->>Alembic: Migration complete
    Alembic-->>Dev: Database upgraded
    
    Note over Dev,DB: Rollback scenario
    Dev->>Alembic: alembic downgrade -1
    Alembic->>Migration: Execute downgrade()
    Migration->>DB: DROP TABLE consent_documents
    Note over DB: Cascades to all partitions
    DB-->>Migration: Tables dropped
    Migration-->>Alembic: Downgrade complete
    Alembic-->>Dev: Database downgraded
Loading

3 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 21 to 35
op.create_table(
"consent_documents",
sa.Column("search_data", postgresql.json.JSONB),
sa.Column("record_data", postgresql.TEXT),
sa.Column("is_latest", postgresql.BOOLEAN),
sa.Column(
"created_at",
postgresql.TIMESTAMP(timezone=True),
nullable=False,
default=sa.func.now,
),
sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
postgresql_partition_by="LIST (is_latest)",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Table missing primary key. All tables need a primary key column with PrimaryKeyConstraint.

Add an id column:

sa.Column("id", sa.String(length=255), nullable=False),

And add before the partition clause:

sa.PrimaryKeyConstraint("id"),
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/fides/api/alembic/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Line: 21:35

Comment:
**logic:** Table missing primary key. All tables need a primary key column with `PrimaryKeyConstraint`.

Add an id column:
```python
sa.Column("id", sa.String(length=255), nullable=False),
```

And add before the partition clause:
```python
sa.PrimaryKeyConstraint("id"),
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't technically need a primary key column; only if we are going to be using SA's ORM for these records (object identity / tracking changes does require a unique identifier). I am not sure we want the overhead of an ORM and tracking here but I haven't tested to see how much overhead there would be in our case, but I am happy to add it - thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a good point. If you're not using SQLAlchemy's ORM for this table and just doing raw queries, you technically don't need a primary key. However, there are a few considerations beyond ORM usage:

  1. PostgreSQL partitioning - While not strictly required, having a primary key (or at least a unique constraint) can help with partition pruning and query performance
  2. Future flexibility - If you ever need to update/delete specific records, having an identifier makes that much easier
  3. Indexing - Without any unique constraint, there's no index on this table at all, which could impact query performance

Since you mentioned you haven't tested the overhead yet, here are your options:

Option 1: Add the primary key (if using ORM or want the benefits above)

Suggested change
op.create_table(
"consent_documents",
sa.Column("search_data", postgresql.json.JSONB),
sa.Column("record_data", postgresql.TEXT),
sa.Column("is_latest", postgresql.BOOLEAN),
sa.Column(
"created_at",
postgresql.TIMESTAMP(timezone=True),
nullable=False,
default=sa.func.now,
),
sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
postgresql_partition_by="LIST (is_latest)",
)
op.create_table(
"consent_documents",
sa.Column("id", sa.String(length=255), nullable=False),
sa.Column("search_data", postgresql.json.JSONB),
sa.Column("record_data", postgresql.TEXT),
sa.Column("is_latest", postgresql.BOOLEAN),
sa.Column(
"created_at",
postgresql.TIMESTAMP(timezone=True),
nullable=False,
server_default=sa.text("now()"),
),
sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.PrimaryKeyConstraint("id"),
postgresql_partition_by="LIST (is_latest)",
)

Option 2: Skip the primary key but add an index on commonly queried fields in search_data or created_at for query performance

What's your access pattern going to be for this table? That might help determine whether the PK is worth it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexes will be created, updated, or removed later as the fields being indexed will be chosen either by configuration or during runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm would it be a painful operation to add a PK down the line? e.g if we want some other table to have a FK to one of these records.
Also I think we almost always use SQLAlchemy's ORM, not sure if I can think of a table that doesn't use it -- do we think the performance advantage of not using it will be noticeable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. Let me address both:

On adding a PK later: It would be painful, yes. PostgreSQL requires that a primary key column be included in the partition key for partitioned tables. So if you add a PK later, you'd need to either:

  1. Include is_latest in a composite PK (e.g., PrimaryKeyConstraint("id", "is_latest")), or
  2. Drop and recreate the entire table structure including all partitions

On ORM usage: You're right that Fides consistently uses the ORM across the codebase. Looking at the models, even high-volume tables like AuditLog, PrivacyRequest, and various consent-related tables all use the ORM. The overhead is generally negligible for most operations, and you get:

  • Type safety
  • Relationship handling
  • Session management
  • Query builder benefits

Given these two considerations and the consistency with the rest of the codebase, I'd recommend adding the PK now. Here's the pattern from similar partitioned tables in the codebase:

Suggested change
op.create_table(
"consent_documents",
sa.Column("search_data", postgresql.json.JSONB),
sa.Column("record_data", postgresql.TEXT),
sa.Column("is_latest", postgresql.BOOLEAN),
sa.Column(
"created_at",
postgresql.TIMESTAMP(timezone=True),
nullable=False,
default=sa.func.now,
),
sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
postgresql_partition_by="LIST (is_latest)",
)
op.create_table(
"privacy_preferences",
sa.Column("id", sa.String(length=255), nullable=False),
sa.Column("search_data", postgresql.json.JSONB),
sa.Column("record_data", postgresql.TEXT),
sa.Column("is_latest", postgresql.BOOLEAN, nullable=False, server_default="f"),
sa.Column(
"created_at",
postgresql.TIMESTAMP(timezone=True),
nullable=False,
server_default=sa.text("now()"),
),
sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.Column("collected_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.PrimaryKeyConstraint("id", "is_latest"),
postgresql_partition_by="LIST (is_latest)",
)

Note the composite PK includes is_latest since it's the partition key. You'll generate IDs at the application level (probably UUIDs).

@codecov
Copy link

codecov bot commented Oct 18, 2025

Codecov Report

❌ Patch coverage is 46.66667% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.38%. Comparing base (4240eb4) to head (04e995c).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
src/fides/api/db/database.py 46.66% 8 Missing ⚠️

❌ Your patch status has failed because the patch coverage (46.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6790      +/-   ##
==========================================
- Coverage   87.39%   87.38%   -0.02%     
==========================================
  Files         518      518              
  Lines       33821    33835      +14     
  Branches     3892     3895       +3     
==========================================
+ Hits        29559    29566       +7     
- Misses       3407     3414       +7     
  Partials      855      855              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@johnewart johnewart requested a review from erosselli October 20, 2025 18:29
default=sa.func.now,
),
sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had discussed calling this collected_at since consented_at kind of implies opt-in consent


def upgrade():
op.create_table(
"consent_documents",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our domain name for "consent" is actually "privacy_preferences" , so maybe we should use that instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will rename it to privacy_preferences if that makes more sense!

depends_on = None


def upgrade():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a PK for the table at all ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, see above conversation with the bot, which argues with itself 😆

alembic.ini Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's better to have it top-level but maybe instead of duplicating the alembic file we can leave it where it is for now and then push a separate PR that moves it to the top?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine 😄 I worry that moving it may have unintended consequences, so I will open a PR with the duplicate versions first and then one removing it so it's easier to undo and still have migration files named in a sane way

…093e92e2a5a_add_consent_data_v3_to_the_database.py


Switch to `server_default` over `default` to be consistent with other uses

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@vercel
Copy link

vercel bot commented Oct 20, 2025

Deployment failed with the following error:

Failed to create deployment for team_zNyOMht1YW4aGAS2AIEa1sZ7 in project prj_RHcBVflyGsFyx07EZ93ZBBnf7e1O: FetchError: request to https://76.76.21.112/v13/now/deployments?ownerId=team_zNyOMht1YW4aGAS2AIEa1sZ7&projectId=prj_RHcBVflyGsFyx07EZ93ZBBnf7e1O&skipAutoDetectionConfirmation=1&teamId=team_zNyOMht1YW4aGAS2AIEa1sZ7&traceCarrier=%7B%22ot-baggage-webhookAt%22%3A%221760996230468%22%2C%22ot-baggage-senderUsername%22%3A%22gh.johnewart%22%2C%22baggage%22%3A%22webhookAt%3D1760996230468%2CsenderUsername%3Dgh.johnewart%22%2C%22x-datadog-trace-id%22%3A%222219022528182712345%22%2C%22x-datadog-parent-id%22%3A%225863888292101682785%22%2C%22x-datadog-sampling-priority%22%3A%222%22%2C%22x-datadog-tags%22%3A%22_dd.p.tid%3D68f6ab8600000000%2C_dd.p.dm%3D-3%22%2C%22traceparent%22%3A%2200-68f6ab86000000001ecb8d3b84eb7819-5160b7555d4af661-01%22%2C%22tracestate%22%3A%22dd%3Dt.tid%3A68f6ab8600000000%3Bt.dm%3A-3%3Bs%3A2%3Bp%3A5160b7555d4af661%22%7D failed, reason: socket hang up

johnewart and others added 3 commits October 20, 2025 14:48
…093e92e2a5a_add_consent_data_v3_to_the_database.py


Force `is_latest` to be not null and default to `f` if not specified.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…093e92e2a5a_add_consent_data_v3_to_the_database.py


Removing un-needed `pass`

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
… but leave the template name change (will make a separate PR with the alembic.ini in the new location)
Copy link
Contributor

@erosselli erosselli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with one question, mostly to decide if we want to stick to our current pattern for PKs or if we have good reasons to deviate from it. let's not forget the changelog entry too :) and updating PR title for future easy searches :)

Comment on lines 24 to 26
# Use a big integer for the primary key to ensure we have enough space for all the records
# and also force the ID to be generated, never allowing it to be overridden without expressly using `OVERRIDING SYSTEM VALUE`
sa.Column("id", BigInteger, Identity(start=1, increment=1, always=True), primary_key=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me . I do want to call out that we don't have this pattern anywhere else in our codebase , our PKs are usually <prefix>_<uuid> (e.g sta_e14dbde3-e664-4a91-947e-139919630a30 , pri_f192b815-fb03-4bfb-bd34-4b27ba5bd168 ) . I don't feel too strongly about adhering to this but will tag @galvana just in case he has opinions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If were still creating a SQLAlchemy model for this, which I think we are, then I would let our base class handle IDs. That way it's consistent, and we'll still have enough identifiers. But open to changes if we have a good reason for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only consideration would be if we have hundreds of millions (or more) of records with an extra string; it's small in comparison to the document, certainly, but not insignificant, and the index will be much bigger as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Hence the 64-bit fixed-size index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this and SQLAlchemy doesn't care about the data type, just that an ID is present. Let's keep this how it is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, SQLAlchemy just needs a stable ID for its internal 'this record has changed and is dirty' bookkeeping

Copy link
Contributor

@galvana galvana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's setting on the ID approach then we can merge

johnewart and others added 3 commits October 23, 2025 11:22
They will be stored in preferences instead

Co-authored-by: Adrian Galvan <adrian@ethyca.com>
Use underscore instead of dash for filename

Co-authored-by: Adrian Galvan <adrian@ethyca.com>
@johnewart johnewart enabled auto-merge October 23, 2025 19:29
@johnewart johnewart added this pull request to the merge queue Oct 24, 2025
Merged via the queue into main with commit 7617d38 Oct 24, 2025
67 of 68 checks passed
@johnewart johnewart deleted the johnewart/ENG-1589 branch October 24, 2025 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants