Add dataset model by dstandish · Pull Request #24613 · apache/airflow

dstandish · 2022-06-23T04:45:36Z

Add dataset model for AIP-48

airflow/models/dataset.py

airflow/migrations/versions/0113_2_4_0_add_dataset_model.py

airflow/models/dataset.py

ashb · 2022-06-23T11:42:02Z

Question: Do we want this model to be used directly in DAGs by authors, or do we want a different class to be used there?

Main reason we might want to avoid using this:

Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)
The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/__init__.py like we do DAG and XComArg.

dstandish · 2022-06-23T16:44:56Z

Question: Do we want this model to be used directly in DAGs by authors, or do we want a different class to be used there?

Main reason we might want to avoid using this:

Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)

The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/__init__.py like we do DAG and XComArg.

I don't have a strong opinion about it but I don't think think the speed of the import right now is at all close to "problematic" for the dag author experience. And yeah we can add that import. But you tell me, what do you think? You would add Dataset and DatasetModel, like is done with DAG now? Where would you locate the dataset class?

airflow/models/dataset.py

kaxil

Minor comments but lgtm

airflow/models/dataset.py

airflow/migrations/versions/0113_2_4_0_add_dataset_model.py

airflow/models/dataset.py

kaxil · 2022-06-23T19:13:05Z

Question: Do we want this model to be used directly in DAGs by authors, or do we want a different class to be used there?
Main reason we might want to avoid using this:

Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)

The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/__init__.py like we do DAG and XComArg.

I don't have a strong opinion about it but I don't think think the speed of the import right now is at all close to "problematic" for the dag author experience. And yeah we can add that import. But you tell me, what do you think? You would add Dataset and DatasetModel, like is done with DAG now? Where would you locate the dataset class?

I would slightly favour a separate DatasetModel and Dataset so Dataset becomes an extensible class, and DatasetModel just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it.

dstandish · 2022-06-23T21:10:27Z

I would slightly favour a separate DatasetModel and Dataset so Dataset becomes an extensible class, and DatasetModel just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it.

Sure, like i say, don't have a strong feeling. Where would you put it? In the same module? Currently the dataset class doesn't do anything anyway so maybe I'll wait until there's actually something implemented before splitting.

uranusjr · 2022-06-24T06:28:55Z

We put DAG and DagModel in the same module, so the same can be done here.

airflow/models/dataset.py

airflow/migrations/versions/0113_2_4_0_add_dataset_model.py

dstandish · 2022-06-24T12:47:24Z

We put DAG and DagModel in the same module, so the same can be done here.

yeah i know about dagmodel @uranusjr, but @ashb's comment indicated to me that putting in diff location was one motivation for using a separate class:

Main reason we might want to avoid using this:
Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)
The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/init.py like we do DAG and XComArg.

but i might be misunderstanding that

ashb · 2022-06-24T13:09:16Z

You understood correctly @dstandish , but that said we don't have a place for those extra classes to live, so for now follow DAG/DagModel (though I hate the Model suffix) so we can get this merged and unblock the rest of the work and keep them both in the same package, and later (possibly as part of AIP-44 https://wiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API we can move the user facing class to somewhere else.)

dstandish · 2022-06-24T15:32:23Z

i went with length=3000. we could push it to 3072 (since we're using ascii collation on mysql and that's the max) but 🤷
.

airflow/models/dataset.py

github-actions · 2022-06-24T22:11:31Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

airflow/models/dataset.py

airflow/migrations/versions/0114_2_4_0_add_dataset_model.py

dstandish · 2022-06-27T20:23:03Z

Alright time to put this one out of it's misery

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df

@kaxil

Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df

boring-cyborg bot added the kind:documentation label Jun 23, 2022

uranusjr reviewed Jun 23, 2022

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

ashb reviewed Jun 23, 2022

View reviewed changes

airflow/migrations/versions/0113_2_4_0_add_dataset_model.py Outdated Show resolved Hide resolved

ashb reviewed Jun 23, 2022

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

dstandish marked this pull request as ready for review June 23, 2022 16:53

dstandish requested review from XD-DENG, kaxil and potiuk as code owners June 23, 2022 16:54

dstandish commented Jun 23, 2022

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

kaxil reviewed Jun 23, 2022

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

kaxil reviewed Jun 23, 2022

View reviewed changes

ephraimbuddy reviewed Jun 24, 2022

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

blag reviewed Jun 24, 2022

View reviewed changes

airflow/migrations/versions/0113_2_4_0_add_dataset_model.py Outdated Show resolved Hide resolved

dstandish force-pushed the add-dataset-model branch from d731931 to 268c590 Compare June 24, 2022 15:37

ashb approved these changes Jun 24, 2022

View reviewed changes

dstandish force-pushed the add-dataset-model branch from 268c590 to d7f2fd9 Compare June 24, 2022 15:59

jedcunningham reviewed Jun 24, 2022

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 24, 2022

apache deleted a comment from skyboi1233 Jun 24, 2022

jedcunningham approved these changes Jun 24, 2022

View reviewed changes

jedcunningham reviewed Jun 24, 2022

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

fix hash

474b485

dstandish force-pushed the add-dataset-model branch from 0cc7615 to 474b485 Compare June 27, 2022 17:16

ephraimbuddy approved these changes Jun 27, 2022

View reviewed changes

blag reviewed Jun 27, 2022

View reviewed changes

airflow/migrations/versions/0114_2_4_0_add_dataset_model.py Outdated Show resolved Hide resolved

blag reviewed Jun 27, 2022

View reviewed changes

airflow/migrations/versions/0114_2_4_0_add_dataset_model.py Outdated Show resolved Hide resolved

add unique suffix

ba5f935

blag approved these changes Jun 27, 2022

View reviewed changes

dstandish merged commit dee5ba3 into apache:main Jun 27, 2022

dstandish deleted the add-dataset-model branch June 27, 2022 20:39

jedcunningham added AIP-48 changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) labels Jun 27, 2022

jedcunningham added this to the Airflow 2.4.0 milestone Sep 15, 2022

mpeteuil mentioned this pull request Feb 15, 2024

Make Datasets hashable #37465

Merged

eladkal added area:data-aware-scheduling assets, datasets, AIP-48 and removed AIP-48 labels Mar 25, 2025

Conversation

dstandish commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashb commented Jun 23, 2022

Uh oh!

dstandish commented Jun 23, 2022

Uh oh!

Uh oh!

Uh oh!

kaxil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaxil commented Jun 23, 2022

Uh oh!

dstandish commented Jun 23, 2022

Uh oh!

uranusjr commented Jun 24, 2022

Uh oh!

Uh oh!

Uh oh!

dstandish commented Jun 24, 2022

Uh oh!

ashb commented Jun 24, 2022

Uh oh!

dstandish commented Jun 24, 2022

Uh oh!

Uh oh!

github-actions bot commented Jun 24, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dstandish commented Jun 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

dstandish commented Jun 23, 2022 •

edited

Loading