Skip to content

Separate "public" Dataset class from SQLA model#25727

Merged
ashb merged 4 commits intoapache:mainfrom
astronomer:dataset-model-not-user-facing
Aug 16, 2022
Merged

Separate "public" Dataset class from SQLA model#25727
ashb merged 4 commits intoapache:mainfrom
astronomer:dataset-model-not-user-facing

Conversation

@ashb
Copy link
Member

@ashb ashb commented Aug 15, 2022

Importing all of SQLA can be very heavy weight (slowing import times),
and it can also make it harder if libraries want to build on top of the
Dataset concept.

Not to mention that the interal-API/db-isolation AIP should mean we
don't want to import any SQLA models directly in user code.

Importing all of SQLA can be very heavy weight (slowing import times),
and it can also make it harder if libraries want to build on top of the
Dataset concept.

Not to mention that the interal-API/db-isolation AIP should mean we
don't want to import any SQLA models directly in user code.
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests are failing:

==== API postgres: 3 failures ====

tests/api_connexion/endpoints/test_dag_run_endpoint.py::test__get_upstream_dataset_events_no_prior: sqlalchemy.orm.exc.UnmappedInstanceError: Class 'airflow.datasets.Dataset' is not mapped
tests/api_connexion/endpoints/test_dag_run_endpoint.py::test__get_upstream_dataset_events_with_prior: sqlalchemy.orm.exc.UnmappedInstanceError: Class 'airflow.datasets.Dataset' is not mapped
tests/api_connexion/schemas/test_dataset_schema.py::TestDatasetSchema::test_serialize: ValueError: Use of List object with `schedule` param is only supported for List[Dataset].

==== Core postgres: 3 failures ====

tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'
tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets_failed: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'
tests/models/test_taskinstance.py::TestTaskInstance::test_outlet_datasets_skipped: TypeError: unsupported operand type(s) for |: '_GenericAlias' and 'NoneType'

==== Other postgres: 3 failures ====

tests/lineage/test_lineage.py::TestLineage::test_lineage: AttributeError: 'dict' object has no attribute '_context'
tests/lineage/test_lineage.py::TestLineage::test_lineage_render: AttributeError: 'dict' object has no attribute '_context'
tests/lineage/test_lineage.py::TestLineage::test_lineage_is_sent_to_backend: AttributeError: 'dict' object has no attribute '_context'

==== Providers postgres: 1 failure ====

tests/providers/papermill/operators/test_papermill.py::TestPapermillOperator::test_execute: AttributeError: 'dict' object has no attribute '_context'

==== WWW postgres: 1 failure ====

tests/www/views/test_views_grid.py::test_next_run_datasets: TypeError: __init__() got an unexpected keyword argument 'id'

@ashb
Copy link
Member Author

ashb commented Aug 15, 2022

Overzealous refactoring ;) will fix those in the morning

@ashb ashb merged commit b90707e into apache:main Aug 16, 2022
@ashb ashb deleted the dataset-model-not-user-facing branch August 16, 2022 12:59
@jedcunningham jedcunningham added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) AIP-48 labels Sep 12, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.4.0 milestone Sep 14, 2022
mpeteuil added a commit to mpeteuil/airflow that referenced this pull request Feb 21, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
potiuk pushed a commit that referenced this pull request Feb 21, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jul 19, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 20, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 9, 2024
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
@eladkal eladkal added area:data-aware-scheduling assets, datasets, AIP-48 and removed AIP-48 labels Mar 25, 2025
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 5, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 26, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 21, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 19, 2025
Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache/airflow#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache/airflow#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.

GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:data-aware-scheduling assets, datasets, AIP-48 area:lineage area:serialization area:webserver Webserver related Issues changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants