Conversation
|
Question: Do we want this model to be used directly in DAGs by authors, or do we want a different class to be used there? Main reason we might want to avoid using this:
|
I don't have a strong opinion about it but I don't think think the speed of the import right now is at all close to "problematic" for the dag author experience. And yeah we can add that import. But you tell me, what do you think? You would add Dataset and DatasetModel, like is done with DAG now? Where would you locate the dataset class? |
I would slightly favour a separate |
Sure, like i say, don't have a strong feeling. Where would you put it? In the same module? Currently the dataset class doesn't do anything anyway so maybe I'll wait until there's actually something implemented before splitting. |
|
We put |
yeah i know about dagmodel @uranusjr, but @ashb's comment indicated to me that putting in diff location was one motivation for using a separate class:
but i might be misunderstanding that |
|
You understood correctly @dstandish , but that said we don't have a place for those extra classes to live, so for now follow DAG/DagModel (though I hate the |
|
i went with length=3000. we could push it to 3072 (since we're using ascii collation on mysql and that's the max) but 🤷 |
d731931 to
268c590
Compare
268c590 to
d7f2fd9
Compare
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
0cc7615 to
474b485
Compare
|
Alright time to put this one out of it's misery |
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious.
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Currently DAGs accept a [`Collection["Dataset"]`](https://github.com/apache/airflow/blob/0c02ead4d8a527cbf0a916b6344f255c520e637f/airflow/models/dag.py#L171) as an option for the `schedule`, but that collection cannot be a `set` because Datasets are not a hashable type. The interesting thing is that [the `DatasetModel` is actually already hashable](https://github.com/apache/airflow/blob/dec78ab3f140f35e507de825327652ec24d03522/airflow/models/dataset.py#L93-L100), so this introduces a bit of duplication since it's the same implementation. However, Airflow users are primarily interfacing with `Dataset`, not `DatasetModel` so I think it makes sense for `Dataset` to be hashable. I'm not sure how to square the duplication or what `__eq__` and `__hash__` provide for `DatasetModel` though. There was discussion on the original PR that created the `Dataset` (apache/airflow#24613) about whether to create two classes or one. In that discussion @kaxil mentioned: > I would slightly favour a separate `DatasetModel` and `Dataset` so `Dataset` becomes an extensible class, and `DatasetModel` just stores the info about the class. So users don't need to care about SQLAlchmey stuff when extending it. That first PR created the `Dataset` model as both SQLAlchemy and user space class though. It wasn't until later on (apache/airflow#25727) that the `DatasetModel` got broken out from `Dataset` and one became two. That provides a bit of background on why they both exist for anyone reading this who is curious. GitOrigin-RevId: d5d6f4b1885a99f5a5e3063dbd54e1f32700c1df
Add dataset model for AIP-48