-
Notifications
You must be signed in to change notification settings - Fork 84
Adds migration for the new consent_documents table #6790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…artitioning; copies the alembic.ini to the root so you can use alembic from the root of the project (where it typically goes).
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR introduces a PostgreSQL partitioned table for storing consent documents (v3) and updates the Alembic configuration to use date-based migration file naming for better historical tracking.
Key Changes:
- Creates
consent_documentstable with LIST partitioning onis_latestboolean field - Adds two partitions:
consent_documents_current(for latest records) andconsent_documents_historic(for historical records) - Copies
alembic.inito project root with date-based file template (xx_YYYY_MM_DD_HHMM-revision-slug) - Updates existing
src/fides/api/alembic/alembic.iniwith same date-based template
Issues Found:
- Missing primary key constraint on
consent_documentstable (critical - table won't function properly) - Boolean column
is_latestshould be non-nullable with default to avoid three-state logic - JSONB column uses incorrect syntax (
postgresql.json.JSONBvspostgresql.JSONB(astext_type=sa.Text())) created_atuses wrong default syntax (default=instead ofserver_default=)
The alembic.ini configuration changes are good and improve developer experience, but the migration has several critical issues that must be fixed before merging.
Confidence Score: 1/5
- This PR has critical database schema issues that will cause runtime failures
- Score reflects multiple critical issues: missing primary key constraint (will cause table operations to fail), incorrect default syntax for created_at timestamp (won't work), nullable boolean used for partitioning (creates ambiguity), and inconsistent JSONB syntax. These are not stylistic concerns but functional errors that will prevent the migration from working correctly.
- src/fides/api/alembic/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py requires immediate attention to fix critical schema issues
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/fides/api/alembic/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py | 2/5 | Creates partitioned consent_documents table with JSONB search data, TEXT record data, and boolean is_latest field - missing primary key, nullable constraints on critical fields, and proper JSONB syntax |
| alembic.ini | 5/5 | New root-level alembic configuration with date-based migration template for easier reasoning about migration history |
| src/fides/api/alembic/alembic.ini | 5/5 | Updated with date-based migration file template matching the root alembic.ini configuration |
Sequence Diagram
sequenceDiagram
participant Dev as Developer
participant Alembic as Alembic CLI
participant Migration as Migration Script
participant DB as PostgreSQL Database
Dev->>Alembic: alembic upgrade head
Alembic->>Migration: Execute upgrade()
Migration->>DB: CREATE TABLE consent_documents
Note over DB: Partitioned by LIST(is_latest)
Migration->>DB: CREATE TABLE consent_documents_current
Note over DB: PARTITION FOR VALUES IN (true)
Migration->>DB: CREATE TABLE consent_documents_historic
Note over DB: PARTITION FOR VALUES IN (false)
DB-->>Migration: Tables created
Migration-->>Alembic: Migration complete
Alembic-->>Dev: Database upgraded
Note over Dev,DB: Rollback scenario
Dev->>Alembic: alembic downgrade -1
Alembic->>Migration: Execute downgrade()
Migration->>DB: DROP TABLE consent_documents
Note over DB: Cascades to all partitions
DB-->>Migration: Tables dropped
Migration-->>Alembic: Downgrade complete
Alembic-->>Dev: Database downgraded
3 files reviewed, 5 comments
| op.create_table( | ||
| "consent_documents", | ||
| sa.Column("search_data", postgresql.json.JSONB), | ||
| sa.Column("record_data", postgresql.TEXT), | ||
| sa.Column("is_latest", postgresql.BOOLEAN), | ||
| sa.Column( | ||
| "created_at", | ||
| postgresql.TIMESTAMP(timezone=True), | ||
| nullable=False, | ||
| default=sa.func.now, | ||
| ), | ||
| sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | ||
| sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | ||
| postgresql_partition_by="LIST (is_latest)", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Table missing primary key. All tables need a primary key column with PrimaryKeyConstraint.
Add an id column:
sa.Column("id", sa.String(length=255), nullable=False),And add before the partition clause:
sa.PrimaryKeyConstraint("id"),Prompt To Fix With AI
This is a comment left during a code review.
Path: src/fides/api/alembic/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Line: 21:35
Comment:
**logic:** Table missing primary key. All tables need a primary key column with `PrimaryKeyConstraint`.
Add an id column:
```python
sa.Column("id", sa.String(length=255), nullable=False),
```
And add before the partition clause:
```python
sa.PrimaryKeyConstraint("id"),
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't technically need a primary key column; only if we are going to be using SA's ORM for these records (object identity / tracking changes does require a unique identifier). I am not sure we want the overhead of an ORM and tracking here but I haven't tested to see how much overhead there would be in our case, but I am happy to add it - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise a good point. If you're not using SQLAlchemy's ORM for this table and just doing raw queries, you technically don't need a primary key. However, there are a few considerations beyond ORM usage:
- PostgreSQL partitioning - While not strictly required, having a primary key (or at least a unique constraint) can help with partition pruning and query performance
- Future flexibility - If you ever need to update/delete specific records, having an identifier makes that much easier
- Indexing - Without any unique constraint, there's no index on this table at all, which could impact query performance
Since you mentioned you haven't tested the overhead yet, here are your options:
Option 1: Add the primary key (if using ORM or want the benefits above)
| op.create_table( | |
| "consent_documents", | |
| sa.Column("search_data", postgresql.json.JSONB), | |
| sa.Column("record_data", postgresql.TEXT), | |
| sa.Column("is_latest", postgresql.BOOLEAN), | |
| sa.Column( | |
| "created_at", | |
| postgresql.TIMESTAMP(timezone=True), | |
| nullable=False, | |
| default=sa.func.now, | |
| ), | |
| sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| postgresql_partition_by="LIST (is_latest)", | |
| ) | |
| op.create_table( | |
| "consent_documents", | |
| sa.Column("id", sa.String(length=255), nullable=False), | |
| sa.Column("search_data", postgresql.json.JSONB), | |
| sa.Column("record_data", postgresql.TEXT), | |
| sa.Column("is_latest", postgresql.BOOLEAN), | |
| sa.Column( | |
| "created_at", | |
| postgresql.TIMESTAMP(timezone=True), | |
| nullable=False, | |
| server_default=sa.text("now()"), | |
| ), | |
| sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| sa.PrimaryKeyConstraint("id"), | |
| postgresql_partition_by="LIST (is_latest)", | |
| ) |
Option 2: Skip the primary key but add an index on commonly queried fields in search_data or created_at for query performance
What's your access pattern going to be for this table? That might help determine whether the PK is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexes will be created, updated, or removed later as the fields being indexed will be chosen either by configuration or during runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm would it be a painful operation to add a PK down the line? e.g if we want some other table to have a FK to one of these records.
Also I think we almost always use SQLAlchemy's ORM, not sure if I can think of a table that doesn't use it -- do we think the performance advantage of not using it will be noticeable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. Let me address both:
On adding a PK later: It would be painful, yes. PostgreSQL requires that a primary key column be included in the partition key for partitioned tables. So if you add a PK later, you'd need to either:
- Include
is_latestin a composite PK (e.g.,PrimaryKeyConstraint("id", "is_latest")), or - Drop and recreate the entire table structure including all partitions
On ORM usage: You're right that Fides consistently uses the ORM across the codebase. Looking at the models, even high-volume tables like AuditLog, PrivacyRequest, and various consent-related tables all use the ORM. The overhead is generally negligible for most operations, and you get:
- Type safety
- Relationship handling
- Session management
- Query builder benefits
Given these two considerations and the consistency with the rest of the codebase, I'd recommend adding the PK now. Here's the pattern from similar partitioned tables in the codebase:
| op.create_table( | |
| "consent_documents", | |
| sa.Column("search_data", postgresql.json.JSONB), | |
| sa.Column("record_data", postgresql.TEXT), | |
| sa.Column("is_latest", postgresql.BOOLEAN), | |
| sa.Column( | |
| "created_at", | |
| postgresql.TIMESTAMP(timezone=True), | |
| nullable=False, | |
| default=sa.func.now, | |
| ), | |
| sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| postgresql_partition_by="LIST (is_latest)", | |
| ) | |
| op.create_table( | |
| "privacy_preferences", | |
| sa.Column("id", sa.String(length=255), nullable=False), | |
| sa.Column("search_data", postgresql.json.JSONB), | |
| sa.Column("record_data", postgresql.TEXT), | |
| sa.Column("is_latest", postgresql.BOOLEAN, nullable=False, server_default="f"), | |
| sa.Column( | |
| "created_at", | |
| postgresql.TIMESTAMP(timezone=True), | |
| nullable=False, | |
| server_default=sa.text("now()"), | |
| ), | |
| sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| sa.Column("collected_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | |
| sa.PrimaryKeyConstraint("id", "is_latest"), | |
| postgresql_partition_by="LIST (is_latest)", | |
| ) |
Note the composite PK includes is_latest since it's the partition key. You'll generate IDs at the application level (probably UUIDs).
...c/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Outdated
Show resolved
Hide resolved
...c/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Outdated
Show resolved
Hide resolved
...c/migrations/versions/xx_2025_10_17_1603_5093e92e2a5a_add_consent_data_v3_to_the_database.py
Show resolved
Hide resolved
...c/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (46.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #6790 +/- ##
==========================================
- Coverage 87.39% 87.38% -0.02%
==========================================
Files 518 518
Lines 33821 33835 +14
Branches 3892 3895 +3
==========================================
+ Hits 29559 29566 +7
- Misses 3407 3414 +7
Partials 855 855 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
...c/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Outdated
Show resolved
Hide resolved
...c/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Outdated
Show resolved
Hide resolved
...c/migrations/versions/xx_2025_10_17_1603_5093e92e2a5a_add_consent_data_v3_to_the_database.py
Show resolved
Hide resolved
| default=sa.func.now, | ||
| ), | ||
| sa.Column("updated_at", postgresql.TIMESTAMP(timezone=True), nullable=True), | ||
| sa.Column("consented_at", postgresql.TIMESTAMP(timezone=True), nullable=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had discussed calling this collected_at since consented_at kind of implies opt-in consent
|
|
||
| def upgrade(): | ||
| op.create_table( | ||
| "consent_documents", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our domain name for "consent" is actually "privacy_preferences" , so maybe we should use that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will rename it to privacy_preferences if that makes more sense!
| depends_on = None | ||
|
|
||
|
|
||
| def upgrade(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a PK for the table at all ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, see above conversation with the bot, which argues with itself 😆
alembic.ini
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's better to have it top-level but maybe instead of duplicating the alembic file we can leave it where it is for now and then push a separate PR that moves it to the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine 😄 I worry that moving it may have unintended consequences, so I will open a PR with the duplicate versions first and then one removing it so it's easier to undo and still have migration files named in a sane way
…093e92e2a5a_add_consent_data_v3_to_the_database.py Switch to `server_default` over `default` to be consistent with other uses Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
Deployment failed with the following error: |
…093e92e2a5a_add_consent_data_v3_to_the_database.py Force `is_latest` to be not null and default to `f` if not specified. Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…093e92e2a5a_add_consent_data_v3_to_the_database.py Removing un-needed `pass` Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
… but leave the template name change (will make a separate PR with the alembic.ini in the new location)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with one question, mostly to decide if we want to stick to our current pattern for PKs or if we have good reasons to deviate from it. let's not forget the changelog entry too :) and updating PR title for future easy searches :)
| # Use a big integer for the primary key to ensure we have enough space for all the records | ||
| # and also force the ID to be generated, never allowing it to be overridden without expressly using `OVERRIDING SYSTEM VALUE` | ||
| sa.Column("id", BigInteger, Identity(start=1, increment=1, always=True), primary_key=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me . I do want to call out that we don't have this pattern anywhere else in our codebase , our PKs are usually <prefix>_<uuid> (e.g sta_e14dbde3-e664-4a91-947e-139919630a30 , pri_f192b815-fb03-4bfb-bd34-4b27ba5bd168 ) . I don't feel too strongly about adhering to this but will tag @galvana just in case he has opinions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If were still creating a SQLAlchemy model for this, which I think we are, then I would let our base class handle IDs. That way it's consistent, and we'll still have enough identifiers. But open to changes if we have a good reason for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only consideration would be if we have hundreds of millions (or more) of records with an extra string; it's small in comparison to the document, certainly, but not insignificant, and the index will be much bigger as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Hence the 64-bit fixed-size index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested this and SQLAlchemy doesn't care about the data type, just that an ID is present. Let's keep this how it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, SQLAlchemy just needs a stable ID for its internal 'this record has changed and is dirty' bookkeeping
galvana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's setting on the ID approach then we can merge
...c/migrations/versions/xx_2025_10_17_1603-5093e92e2a5a_add_consent_data_v3_to_the_database.py
Outdated
Show resolved
Hide resolved
They will be stored in preferences instead Co-authored-by: Adrian Galvan <adrian@ethyca.com>
Use underscore instead of dash for filename Co-authored-by: Adrian Galvan <adrian@ethyca.com>
Ticket ENG-1589
Description Of Changes
Adds PostgreSQL migration to create the new consent document table and its partitions.
Code Changes
The migration as mentioned above, but also copies the
alembic.inito the root so you can use it without moving around (also, this is "normally" where alembic puts the configuration file, we might want to move it but I just copied it for now)Changed both the copies with an updated migration file name template to be date-based instead of just the revision - makes it easier to reason about the history and linear progression of changes to the database.
Steps to Confirm
No manual steps needed
Pre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and works