Skip to content

Make Datasets hashable#37465

Merged
potiuk merged 1 commit intoapache:mainfrom
mpeteuil:hashable-datasets
Feb 21, 2024
Merged

Make Datasets hashable#37465
potiuk merged 1 commit intoapache:mainfrom
mpeteuil:hashable-datasets

Conversation

@mpeteuil
Copy link
Contributor

@mpeteuil mpeteuil commented Feb 15, 2024

Currently DAGs accept a Collection["Dataset"] as an option for the schedule, but that collection cannot be a set because Datasets are not a hashable type. The interesting thing is that the DatasetModel is actually already hashable, so this introduces a bit of duplication since it's mostly the same implementation. However, Airflow users are primarily interfacing with Dataset, not DatasetModel so I think it makes sense for Dataset to be hashable. I'm not sure how to square the duplication or what __eq__ and __hash__ provide for DatasetModel though.

There was discussion on the original PR that created the Dataset (#24613) about whether to create two classes or one. In that discussion @kaxil mentioned:

I would slightly favour a separate DatasetModel and Dataset so Dataset becomes an extensible class, and DatasetModel just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it.

That first PR created the Dataset model as both SQLAlchemy and user space class though. It wasn't until later on (#25727) that the DatasetModel got broken out from Dataset and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Feb 16, 2024

LGTM. But I would prefer Others to also take a look - maybe there is a good reason for not having it hashable :). And thanks for the thorough research on the past !

BTW. Static checks need fixing on that one (I heartily recommend installing pre-commit and largely stop caring about those failures).

Currently DAGs accept a
[`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171)
as an option for the `schedule`, but that collection cannot be a `set`
because Datasets are not a hashable type. The interesting thing is that
[the `DatasetModel` is actually already
hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100),
so this introduces a bit of duplication since it's the same
implementation. However, Airflow users are primarily interfacing with
`Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to
be hashable. I'm not sure how to square the duplication or what `__eq__`
and `__hash__` provide for `DatasetModel` though.

There was discussion on the original PR that created the `Dataset`
(apache#24613) about whether to create
two classes or one. In that discussion @kaxil mentioned:

> I would slightly favour a separate `DatasetModel` and `Dataset` so
`Dataset` becomes an extensible class, and `DatasetModel` just stores
the info about the class. So users don't need to care about SQLAlchmey
stuff when extending it.

That first PR created the `Dataset` model as both SQLAlchemy and user
space class though. It wasn't until later on
(apache#25727) that the `DatasetModel`
got broken out from `Dataset` and one became two. That provides a bit of
background on why they both exist for anyone reading this who is
curious.
@potiuk potiuk merged commit d5d6f4b into apache:main Feb 21, 2024
@ephraimbuddy ephraimbuddy added this to the Airflow 2.9.0 milestone Mar 6, 2024
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:new-feature Changelog: New Features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants