-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Serialize mapped tasks and task groups #20743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5461a41 to
30afd2e
Compare
airflow/models/baseoperator.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t get what this implies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the task_id from the kwargs (so it isn't stored in the mapped data structure) -- so del kwargs['task_id'] but without the try/except for when it's not there? I'll double check if I can just do this with a del.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my question is more like why do we need to treat task_id specially (and why we didn’t need to previously, but only when we add operator mapping)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task ID was getting stored in partial_kwargs which I hadn't noticed until I looked at the serialized representation of a MappedOperator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about storing task_id separately instead and not in partial_kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dealing with it in partial_kwargs --
# Store the args passed to init -- we need them to support task.map serialzation!
kwargs.pop('task_id', None)
self._BaseOperator__init_kwargs.update(kwargs) # type: ignore
(The only use of __init_kwargs is for building partial_kwargs when mapping a task)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other option:
diff --git a/airflow/models/baseoperator.py b/airflow/models/baseoperator.py
index 752b8a6cf..71170320d 100644
--- a/airflow/models/baseoperator.py
+++ b/airflow/models/baseoperator.py
@@ -136,6 +136,7 @@ class BaseOperatorMeta(abc.ABCMeta):
non_optional_args = {
name for (name, param) in non_variadic_params.items() if param.default == param.empty
}
+ non_optional_args -= {'task_id'}
class autostacklevel_warn:
def __init__(self):
@@ -158,7 +159,7 @@ class BaseOperatorMeta(abc.ABCMeta):
func.__globals__['warnings'] = autostacklevel_warn()
@functools.wraps(func)
- def apply_defaults(self: "BaseOperator", *args: Any, **kwargs: Any) -> Any:
+ def apply_defaults(self: "BaseOperator", *args: Any, task_id: str, **kwargs: Any) -> Any:
from airflow.models.dag import DagContext
from airflow.utils.task_group import TaskGroupContext
@@ -201,15 +202,15 @@ class BaseOperatorMeta(abc.ABCMeta):
hook = getattr(self, '_hook_apply_defaults', None)
if hook:
- args, kwargs = hook(**kwargs, default_args=default_args)
+ args, kwargs = hook(task_id=task_id, **kwargs, default_args=default_args)
+ task_id = kwargs.pop('task_id')
default_args = kwargs.pop('default_args', {})
if not hasattr(self, '_BaseOperator__init_kwargs'):
self._BaseOperator__init_kwargs = {}
- result = func(self, **kwargs, default_args=default_args)
+ result = func(self, **kwargs, task_id=task_id, default_args=default_args)
# Store the args passed to init -- we need them to support task.map serialzation!
- kwargs.pop('task_id', None)
self._BaseOperator__init_kwargs.update(kwargs) # type: ignore
# Here we set upstream task defined by XComArgs passed to template fields of the operator
Which do you think is better @uranusjr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The “other” one looks better to me, it feels like we are intentionally treating task_id different from all other arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9439378 -- and a bit of a justification/reason in the commit message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out this has an unfortunate side effect:
>>> FooOperator()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: apply_defaults() missing 1 required keyword-only argument: 'task_id'and error message seems to be hard-coded in the interpreter and cannot be patched. I’m trying yet another approach (see new commits)
5644640 to
3ef6268
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't partial_kwargs={} be a default on the MappedOperator itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, cos in all other situations where this is constructor we need to pass a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for these three, shouldn't we have a default on MappedOperator itself?
211ae54 to
08c4236
Compare
|
I'm now saying this is "good enough" -- we can make changes later if we want to |
It simplifies a few things. We also deal with (and test) the old name when deserializing
Since `task_id` is handled speically in the serialization of MappedOperators, we don't want it duplicated in to the partial_kwargs.
702407b to
9439378
Compare
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
| @deps.default | ||
| def _deps_from_class(self): | ||
| return self.operator_class.deps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m assuming many of these defaults (this, _is_dummy, and template_fields, I believe) don’t need to consider when operator_class is a str because in that case these values would’ve been supplied explicitly instead. (It’s kind of bad it’s designed this way but I guess that can be said for many things regarding the current serialisation implementation...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly that. I've already started thinking in my head about how to refactor/rearchitect the serialization and deserialization.
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
|
Seems to be working. If the latest two commits look good to you @ashb this is ready to go in. |
|
Looks good -- thanks for picking up those fiixes TP. |
|
I’ll start working on attribute parity between BaseOperator and MappedOperator. |
|
I'm going to work on adding |
This is needed for the scheduler to be able to expand the task instances at runtime