Skip to content

Conversation

@kaxil
Copy link
Member

@kaxil kaxil commented Oct 22, 2024

closes #43161 part of AIP-72.

As part of the ongoing work for AIP-72: Task Execution Interface, we are migrating the task_instance table to use a UUID primary key. This change is being made to simplify task instance identification, especially when communicating between the executor and workers.

Currently, the primary key of task_instance is a composite key consisting of dag_id, task_id, run_id, and map_index as shown below. This migration introduces a UUID v7 column (id) as the new primary key.

__tablename__ = "task_instance"
task_id = Column(StringID(), primary_key=True, nullable=False)
dag_id = Column(StringID(), primary_key=True, nullable=False)
run_id = Column(StringID(), primary_key=True, nullable=False)
map_index = Column(Integer, primary_key=True, nullable=False, server_default=text("-1"))

Why UUID v7?

The UUID v7 format was chosen because of its improved temporal sorting capabilities. For existing records, UUID v7 will be generated using either the queued_dttm, start_date, or the current timestamp.

image

(From this blog post.)

Changes

  1. Migrated Primary Key to UUID v7

    • Replaced the composite primary key (dag_id, task_id, run_id, map_index) with a UUID v7 id field, ensuring temporal sorting and simplified task instance identification.
  2. Database-Specific UUID v7 Functions for DB Migrations

    • Added UUID v7 functions for each database, just for DB migration since generating uuid's on Python and then storing it in DB can be slow:
      • PostgreSQL: Uses pgcrypto for generation with fallback.
      • MySQL: Custom deterministic UUID v7 function.
      • SQLite: Utilizes uuid6 Python package.
  3. Updated Constraints and Indexes

    • Added UniqueConstraint on (dag_id, task_id, run_id, map_index) for compatibility.
    • Modified foreign key constraints for the new primary key, handling downgrades to restore previous constraints.
  4. Model and API Adjustments

    • Updated TaskInstance model to use UUID v7 as the primary key via uuid6 library, that has uuid7 ! 😄 .
    • Adjusted REST API, views, and queries to support UUID-based lookups.
    • Modified tests for compatibility with the new primary key.

Issues identified

After updating the primary key on the TaskInstance model to a UUID7 id field (from the composite primary key of ["dag_id", "task_id", "run_id", "map_index"]), I have now run into issues in testing and session management:

  1. session.merge() Compatibility

    The session.merge() function operates strictly on primary keys as mentioned in this SQLAlchemy docs, which means it no longer recognizes the unique constraint on ["dag_id", "task_id", "run_id", "map_index"] to identify existing TaskInstance records. This leads to issues in cases where session.merge() was previously used to either update or insert TaskInstance records, as it now fails to locate the intended record by the unique constraint.

    Example:

    dagrun_1 = dag.create_dagrun(
    run_type=DagRunType.BACKFILL_JOB,
    state=DagRunState.RUNNING,
    start_date=DEFAULT_DATE,
    execution_date=DEFAULT_DATE,
    data_interval=(DEFAULT_DATE, DEFAULT_DATE),
    **triggered_by_kwargs, # type: ignore
    )
    session.merge(dagrun_1)
    task_instance_1 = TI(t_1, run_id=dagrun_1.run_id, state=ti_state_begin)
    task_instance_1.job_id = 123
    session.merge(task_instance_1)
    session.commit()

    A possible workaround would be to follow session.merge() calls with refresh_from_db() to ensure the instance has the latest state:

    task_instance_1 = session.merge(TaskInstance(task=t_1, run_id=dagrun_1.run_id, state=ti_state_begin))
    task_instance_1.refresh_from_db()  # Update instance with current DB state
  2. session.get() usage with TaskInstanceKey

    Similarly, session.get() and other primary key-based session operations now rely solely on the id UUID rather than the composite fields. This results in errors when accessing or sorting records based on the previous composite fields, as session.get() cannot match identifiers to TaskInstance records by the unique constraint.

    Current code in main:

    session.get(TaskInstance, ti1.key.primary).try_number += 1

    I have been changing those references from TaskInstanceKey in Database calls with ti.id, example:

    session.get(TaskInstance, ti1.id).try_number += 1

  3. Airflow REST API calls which filters TIs

    Similar to (2) above, some of our REST API call filters TI records based on current primary key as below:

    if run_id and not session.get(
    TI, {"task_id": task_id, "dag_id": dag_id, "run_id": run_id, "map_index": -1}

    I am changing that to session.scalars(select(TI).where(...))..one_or_none() queries. For example, the above code is changed to the following in this PR.

    select_stmt = select(TI).where(
    TI.dag_id == dag_id, TI.task_id == task_id, TI.run_id == run_id, TI.map_index == -1
    )
    if run_id and not session.scalars(select_stmt).one_or_none():

TODOs

  • Check if the DB user has permissions to add CREATE extension for Postgres as we install pgcrypto
  • Convert the raw SQL for dropping the primary key (CASCADE) into an Alembic operation. This will allow capturing foreign key constraints and managing them in a downgrade scenario. MySQL does not support dropping foreign keys using CASCADE with primary keys anyway.
  • Ensure foreign key constraints are re-added during downgrades after the primary key has been changed back to the composite key.
  • SQLite Support: Add SQLite support for UUID v7 generation using the uuid6 Python package.
  • Add changes to TaskInstance model file to add id

Future Work

@kaxil kaxil requested a review from ashb October 22, 2024 01:26
@kaxil kaxil added this to the Airflow 3.0.0 milestone Oct 22, 2024
kaxil added a commit to astronomer/airflow that referenced this pull request Oct 22, 2024
While working on this apache#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.
@ashb
Copy link
Member

ashb commented Oct 22, 2024

A neat trick to avoid the need for pgcrypto extension: https://brandur.org/fragments/secure-bytes-without-pgcrypto

Though uuid_send() fn it uses doesn't exist on Pg10, it only comes from Pg11 onwards. However I do think it's reasonable to say 12 or 13 is the min required Pg version for Airflow 3 (Postgres only supports v13 onwards).

Oh, Pg 12 is officially all we support anyway, so I think using the uuid_send() function approach could work fine.

kaxil added a commit to astronomer/airflow that referenced this pull request Oct 22, 2024
While working on this apache#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.
Lee-W pushed a commit that referenced this pull request Oct 22, 2024
While working on this #43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.
harjeevanmaan pushed a commit to harjeevanmaan/airflow that referenced this pull request Oct 23, 2024
While working on this apache#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.
PaulKobow7536 pushed a commit to PaulKobow7536/airflow that referenced this pull request Oct 24, 2024
While working on this apache#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.
@kaxil kaxil force-pushed the add-ti-uuid branch 10 times, most recently from 6b121ca to a25a775 Compare October 26, 2024 21:46
@kaxil
Copy link
Member Author

kaxil commented Oct 27, 2024

Issues identified

After updating the primary key on the TaskInstance model to a UUID7 id field (from the composite primary key of ["dag_id", "task_id", "run_id", "map_index"]), I have now run into issues in testing and session management:

  1. session.merge() Compatibility

    The session.merge() function operates strictly on primary keys as mentioned in this SQLAlchemy docs, which means it no longer recognizes the unique constraint on ["dag_id", "task_id", "run_id", "map_index"] to identify existing TaskInstance records. This leads to issues in cases where session.merge() was previously used to either update or insert TaskInstance records, as it now fails to locate the intended record by the unique constraint.

    Example:

    dagrun_1 = dag.create_dagrun(
    run_type=DagRunType.BACKFILL_JOB,
    state=DagRunState.RUNNING,
    start_date=DEFAULT_DATE,
    execution_date=DEFAULT_DATE,
    data_interval=(DEFAULT_DATE, DEFAULT_DATE),
    **triggered_by_kwargs, # type: ignore
    )
    session.merge(dagrun_1)
    task_instance_1 = TI(t_1, run_id=dagrun_1.run_id, state=ti_state_begin)
    task_instance_1.job_id = 123
    session.merge(task_instance_1)
    session.commit()

    A possible workaround would be to follow session.merge() calls with refresh_from_db() to ensure the instance has the latest state:

    task_instance_1 = session.merge(TaskInstance(task=t_1, run_id=dagrun_1.run_id, state=ti_state_begin))
    task_instance_1.refresh_from_db()  # Update instance with current DB state
  2. session.get() usage with TaskInstanceKey

    Similarly, session.get() and other primary key-based session operations now rely solely on the id UUID rather than the composite fields. This results in errors when accessing or sorting records based on the previous composite fields, as session.get() cannot match identifiers to TaskInstance records by the unique constraint.

    Current code in main:

    session.get(TaskInstance, ti1.key.primary).try_number += 1

    I have been changing those references from TaskInstanceKey in Database calls with ti.id, example:

    session.get(TaskInstance, ti1.id).try_number += 1

  3. Airflow REST API calls which filters TIs

    Similar to (2) above, some of our REST API call filters TI records based on current primary key as below:

    if run_id and not session.get(
    TI, {"task_id": task_id, "dag_id": dag_id, "run_id": run_id, "map_index": -1}

    I am changing that to session.scalars(select(TI).where(...))..one_or_none() queries. For example, the above code is changed to the following in this PR.

    select_stmt = select(TI).where(
    TI.dag_id == dag_id, TI.task_id == task_id, TI.run_id == run_id, TI.map_index == -1
    )
    if run_id and not session.scalars(select_stmt).one_or_none():

@kaxil kaxil force-pushed the add-ti-uuid branch 2 times, most recently from 672047f to 3c61ba7 Compare October 27, 2024 01:57
@kaxil kaxil added the legacy api Whether legacy API changes should be allowed in PR label Oct 27, 2024
@kaxil kaxil closed this Oct 27, 2024
@kaxil kaxil reopened this Oct 27, 2024
@kaxil kaxil force-pushed the add-ti-uuid branch 2 times, most recently from 5e3682a to 92fc37b Compare October 28, 2024 00:31
@kaxil kaxil marked this pull request as ready for review October 28, 2024 00:31
@ashb
Copy link
Member

ashb commented Oct 28, 2024

I do think long-term/by release of 3.0 TaskInstanceHistory should also have a uuid.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, our use of session.merge() in Airflow is v. bad :(

@potiuk
Copy link
Member

potiuk commented Oct 28, 2024

Yeah, our use of session.merge() in Airflow is v. bad :(

Agree. This is one of the reasons I wanted to avoid all those db-schema changes for the initial proposal of multi-team and proposed the team-id prefix, becaue I was afraid we are going to have similar ripple-effects that will result in rewriting big part of our database code. Which seem that it was a pretty justified worry.

@potiuk
Copy link
Member

potiuk commented Oct 28, 2024

BTW. One of the ways it could be helped - we could potentially generate the unique id by hashing the remaining fields - with appropriate hashing algorithm, we could have very low probability of collision (and maybe we could implement a mechanism to detect collisions and implement handling of those collision in similar ways they are handled in hash maps) - that would be another way how we could approach it (and there the merge code could be simpler and still use session.merge()

But this one also comes with it's own set of difficulties and collision handling is likely going to be complex. But I thought it's worth menioning it here as an option.

@potiuk
Copy link
Member

potiuk commented Oct 28, 2024

One problem with generating the unique hash with other fields, is that it will be even more difficult to add new fields, so probably getting rid of session.merge() everywhere and replacing it with our own way of merging objects and database entries by retrieving the unique id based on the unique fields from the DB before merging is likely a long-term better approach.

@kaxil kaxil merged commit ce0c1c0 into apache:main Oct 28, 2024
1 check passed
@kaxil kaxil deleted the add-ti-uuid branch October 28, 2024 12:40
@kaxil
Copy link
Member Author

kaxil commented Oct 28, 2024

@potiuk Yeah, I considered hashing but it is bad for Databases for indexing and, as a result, querying since it won't have temporal properties apart from Collision Handling complexity.

UUID v7 is explicitly designed to support distributed databases with high insert rates due to its temporal ordering.

BTW. One of the ways it could be helped - we could potentially generate the unique id by hashing the remaining fields - with appropriate hashing algorithm, we could have very low probability of collision (and maybe we could implement a mechanism to detect collisions and implement handling of those collision in similar ways they are handled in hash maps) - that would be another way how we could approach it (and there the merge code could be simpler and still use session.merge()

But this one also comes with it's own set of difficulties and collision handling is likely going to be complex. But I thought it's worth menioning it here as an option.

@kaxil
Copy link
Member Author

kaxil commented Oct 28, 2024

Yeah, our use of session.merge() in Airflow is v. bad :(

Indeed

@potiuk
Copy link
Member

potiuk commented Oct 28, 2024

UUID v7 is explicitly designed to support distributed databases with high insert rates due to its temporal ordering.

Yep. that settles it as well. We would have to implement our hashing in similar way (if possible at all)

kaxil added a commit to astronomer/airflow that referenced this pull request Nov 4, 2024
apache#43243 added Task Instance "id" as primary key. This PR passes the same API to API responses.
kaxil added a commit that referenced this pull request Nov 4, 2024
#43243 added Task Instance "id" as primary key. This PR passes the same API to API responses.
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
While working on this apache#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
* Migrate `TaskInstance` to UUID v7 primary key

closes apache#43161 part of [AIP-72](https://github.com/orgs/apache/projects/405).

As part of the ongoing work for [AIP-72: Task Execution Interface](https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-72+Task+Execution+Interface+aka+Task+SDK), we are migrating the task_instance table to use a UUID primary key. This change is being made to simplify task instance identification, especially when communicating between the executor and workers.

Currently, the primary key of task_instance is a composite key consisting of `dag_id, task_id, run_id, and map_index` as shown below. This migration introduces a **UUID v7** column (`id`) as the new primary key.

https://github.com/apache/airflow/blob/b4269f33c7151e6d61e07333003ec1e219285b07/airflow/models/taskinstance.py#L1815-L1819

The UUID v7 format was chosen because of its improved temporal sorting capabilities. For existing records, UUID v7 will be generated using either the queued_dttm, start_date, or the current timestamp.

<img width="792" alt="image" src="https://github.com/user-attachments/assets/ba68c9ae-4f9d-4cd2-8504-1b671d23ef6c">

(From [this blog post](https://www.toomanyafterthoughts.com/uuids-are-bad-for-database-index-performance-uuid7).)

1. **Migrated Primary Key to UUID v7**
   - Replaced the composite primary key (`dag_id`, `task_id`, `run_id`, `map_index`) with a UUID v7 `id` field, ensuring temporal sorting and simplified task instance identification.

2. **Database-Specific UUID v7 Functions**
   - Added UUID v7 functions for each database:
      - **PostgreSQL**: Uses `pgcrypto` for generation with fallback.
      - **MySQL**: Custom deterministic UUID v7 function.
      - **SQLite**: Utilizes `uuid6` Python package.

3. **Updated Constraints and Indexes**
   - Added `UniqueConstraint` on (`dag_id`, `task_id`, `run_id`, `map_index`) for compatibility.
   - Modified foreign key constraints for the new primary key, handling downgrades to restore previous constraints.

4. **Model and API Adjustments**
   - Updated `TaskInstance` model to use UUID v7 as the primary key via [`uuid6`](https://pypi.org/project/uuid6/) library, that has uuid7 ! 😄 .
   - Adjusted REST API, views, and queries to support UUID-based lookups.
   - Modified tests for compatibility with the new primary key.
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
apache#43243 added Task Instance "id" as primary key. This PR passes the same API to API responses.
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 27, 2025
While working on this apache/airflow#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.

GitOrigin-RevId: ae90a8f49194ff112dee752a4c99677ee01cf46b
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 27, 2025
apache/airflow#43243 added Task Instance "id" as primary key. This PR passes the same API to API responses.

GitOrigin-RevId: 8b0ac8d5302623d5ec0e19d108f337f8916c1109
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 22, 2025
While working on this apache/airflow#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.

GitOrigin-RevId: ae90a8f49194ff112dee752a4c99677ee01cf46b
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 23, 2025
apache/airflow#43243 added Task Instance "id" as primary key. This PR passes the same API to API responses.

GitOrigin-RevId: 8b0ac8d5302623d5ec0e19d108f337f8916c1109
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 20, 2025
While working on this apache/airflow#43243, I was following https://github.com/apache/airflow/blob/main/contributing-docs/13_metadata_database_updates.rst and I ran into an error with  pre-commit hook.

When running the revision command as follows:

```
root@f1f78138ad78:/opt/airflow/airflow# alembic revision -m "New revision"
  Generating /opt/airflow/airflow/migrations/versions/01b38be821e9_new_revision.py ...  done
```

It creates a file as follows:

```python
"""New revision

Revision ID: cd7be1ae8b80
Revises: 05234396c6fc
Create Date: 2024-10-22 01:44:17.873864

"""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = 'cd7be1ae8b80'
down_revision = '05234396c6fc'
branch_labels = None
depends_on = None
```

Notice single quotes in `revision` & `down_revision`.

Now if I just run that single pre-commit hook (`update-migration-references`), it fails

```
❯ pre-commit run "update-migration-references" --all-files
Update migration ref doc.................................................Failed
- hook id: update-migration-references
- exit code: 1

Using 'uv' to install Airflow

Using airflow version from current sources

Updating migration reference for airflow
Making sure airflow version updated
Making sure there's no mismatching revision numbers
Traceback (most recent call last):
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 246, in <module>
    correct_mismatching_revision_nums(revisions=revisions)
  File "/opt/airflow/scripts/in_container/run_migration_reference.py", line 230, in correct_mismatching_revision_nums
    new_content = content.replace(revision_id_match.group(1), revision_match.group(1), 1)
AttributeError: 'NoneType' object has no attribute 'group'
Error 1 returned

If you see strange stacktraces above, run `breeze ci-image build --python 3.9` and try again.
```

That isn't a problem generally as `ruff` will fail before and convert it into double quotes. But rather than doing that, we fix it at source.

GitOrigin-RevId: ae90a8f49194ff112dee752a4c99677ee01cf46b
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 20, 2025
apache/airflow#43243 added Task Instance "id" as primary key. This PR passes the same API to API responses.

GitOrigin-RevId: 8b0ac8d5302623d5ec0e19d108f337f8916c1109
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:db-migrations PRs with DB migration kind:documentation legacy api Whether legacy API changes should be allowed in PR

Development

Successfully merging this pull request may close these issues.

Make Task Instance primary key be a UUID

4 participants