Skip to content

Add dataset model#24613

Merged
dstandish merged 21 commits intoapache:mainfrom
astronomer:add-dataset-model
Jun 27, 2022
Merged

Add dataset model#24613
dstandish merged 21 commits intoapache:mainfrom
astronomer:add-dataset-model

Conversation

@dstandish
Copy link
Contributor

@dstandish dstandish commented Jun 23, 2022

Add dataset model for AIP-48

@ashb
Copy link
Member

ashb commented Jun 23, 2022

Question: Do we want this model to be used directly in DAGs by authors, or do we want a different class to be used there?

Main reason we might want to avoid using this:

  • Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)
  • The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/__init__.py like we do DAG and XComArg.

@dstandish
Copy link
Contributor Author

Question: Do we want this model to be used directly in DAGs by authors, or do we want a different class to be used there?

Main reason we might want to avoid using this:

  • Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)
  • The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/__init__.py like we do DAG and XComArg.

I don't have a strong opinion about it but I don't think think the speed of the import right now is at all close to "problematic" for the dag author experience. And yeah we can add that import. But you tell me, what do you think? You would add Dataset and DatasetModel, like is done with DAG now? Where would you locate the dataset class?

@dstandish dstandish marked this pull request as ready for review June 23, 2022 16:53
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments but lgtm

@kaxil
Copy link
Member

kaxil commented Jun 23, 2022

Question: Do we want this model to be used directly in DAGs by authors, or do we want a different class to be used there?
Main reason we might want to avoid using this:

  • Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)
  • The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/__init__.py like we do DAG and XComArg.

I don't have a strong opinion about it but I don't think think the speed of the import right now is at all close to "problematic" for the dag author experience. And yeah we can add that import. But you tell me, what do you think? You would add Dataset and DatasetModel, like is done with DAG now? Where would you locate the dataset class?

I would slightly favour a separate DatasetModel and Dataset so Dataset becomes an extensible class, and DatasetModel just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it.

@dstandish
Copy link
Contributor Author

I would slightly favour a separate DatasetModel and Dataset so Dataset becomes an extensible class, and DatasetModel just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it.

Sure, like i say, don't have a strong feeling. Where would you put it? In the same module? Currently the dataset class doesn't do anything anyway so maybe I'll wait until there's actually something implemented before splitting.

@uranusjr
Copy link
Member

We put DAG and DagModel in the same module, so the same can be done here.

@dstandish
Copy link
Contributor Author

We put DAG and DagModel in the same module, so the same can be done here.

yeah i know about dagmodel @uranusjr, but @ashb's comment indicated to me that putting in diff location was one motivation for using a separate class:

Main reason we might want to avoid using this:
Loading SQLA and all of the models is slow (Airflow does this right now at "boot" anyway, but I am working on making that not be the case)
The import path (airflow.models.dataset) though we can fix that with a lazy import in airflow/init.py like we do DAG and XComArg.

but i might be misunderstanding that

@ashb
Copy link
Member

ashb commented Jun 24, 2022

You understood correctly @dstandish , but that said we don't have a place for those extra classes to live, so for now follow DAG/DagModel (though I hate the Model suffix) so we can get this merged and unblock the rest of the work and keep them both in the same package, and later (possibly as part of AIP-44 https://wiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API we can move the user facing class to somewhere else.)

@dstandish
Copy link
Contributor Author

i went with length=3000. we could push it to 3072 (since we're using ascii collation on mysql and that's the max) but 🤷
.

@dstandish dstandish force-pushed the add-dataset-model branch from d731931 to 268c590 Compare June 24, 2022 15:37
@dstandish dstandish force-pushed the add-dataset-model branch from 268c590 to d7f2fd9 Compare June 24, 2022 15:59
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 24, 2022
@apache apache deleted a comment from skyboi1233 Jun 24, 2022
@dstandish dstandish force-pushed the add-dataset-model branch from 0cc7615 to 474b485 Compare June 27, 2022 17:16
@dstandish
Copy link
Contributor Author

Alright time to put this one out of it's misery

@dstandish dstandish merged commit dee5ba3 into apache:main Jun 27, 2022
@dstandish dstandish deleted the add-dataset-model branch June 27, 2022 20:39
@jedcunningham jedcunningham added AIP-48 changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) labels Jun 27, 2022
@jedcunningham jedcunningham added this to the Airflow 2.4.0 milestone Sep 15, 2022
mpeteuil added a commit to mpeteuil/airflow that referenced this pull request Feb 21, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
potiuk pushed a commit that referenced this pull request Feb 21, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jul 19, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 20, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 9, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
@eladkal eladkal added area:data-aware-scheduling assets, datasets, AIP-48 and removed AIP-48 labels Mar 25, 2025
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 5, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 26, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 21, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 19, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data-aware-scheduling assets, datasets, AIP-48 changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge kind:documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants