-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add Databricks Deferrable Operators #19736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Databricks Deferrable Operators #19736
Conversation
|
@chinwobble, kindly requesting to review :) |
| super().__init__(*args, **kwargs) | ||
|
|
||
| async def __aenter__(self): | ||
| self._session = aiohttp.ClientSession() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good stuff.
Currently you are using one Hook per trigger. That means if you had 32 concurrent triggers, then they would all have their own client sessions.
I've had a look at the docs for aiohttp, it says:
it is suggested you use a single session for the lifetime of your application to benefit from connection pooling.
https://docs.aiohttp.org/en/stable/client_reference.html
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that hooks are not necessarily run in the same process, so if you want to share sessions among them, you must move the abstraction to the trigger instead.
I wonder how viable it’d be to refactor the sychronous version (DatabricksHook) into a sans I/O base class that can be used by DatabricksExecutionTrigger, instead of implementing a new, entirely separate hook for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally each Triggerer process would share one session. A DAG can trigger multiple DatabricksExecutionTrigger at the same time. Could these triggers all share the same session?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could, yes, because async triggers are designed to run in the same process, that’s way they are async. If they run in their own process, there’s no point to use asyncio in the first place. But the difficult is who should manage that shared session. I wonder if @andrewgodwin actually though about this preemptively 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both so much, that's very good points.
Trying to summarise:
-
Ideally all tasks executed by each
trigerrershould share the same session - however, not sure if it'd be possible without touching the core AirflowTriggererfunctionality. -
Note that hooks are not necessarily run in the same process, so if you want to share sessions among them, you must move the abstraction to the trigger instead.
Does it mean to move
ClientSessioninitialisation to the trigger, i.e. as a property ofDatabricksExecutionTrigger?
My understanding is that when there're multipletriggererprocesses, each trigger will be executed as a separate async task with its ownClientSessioninstance. So even if we move it toDatabricksExecutionTrigger, it'd still create a session for each trigger run. -
Refactoring
Databrickshook - I agree it'd be the perfect solution to keep everything inside a single class. Also, after the comment by @alexott it's even more motivation to do that to prevent making double work in the future. I'll try to refactor it, to minimize repeatable code and isolating IO operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must move the abstraction to the trigger instead.
My understanding is that when there're multiple
triggererprocesses, each trigger will be executed as a separate async task with its ownClientSessioninstance. So even if we move it toDatabricksExecutionTrigger, it'd still create a session for each trigger run.
Sorry, I think I meant to say triggerer instead of trigger 🙂
Your general summary is correct though; I don’t think it’s particularly worthwhile to make that optimisation right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is no easy mechanism for pooling connections that are all running under a single triggerer right now, but you can also auto-pool from an async hook implementation by detecting same-thread different-coroutine if you need to. Just makes it quite a bit more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewgodwin Another issue I encountered with deferrable operators is that task duration during deferral is not counted.
When I run this operator, it says all my DAG's task finish in 1 sec.
| super().__init__(*args, **kwargs) | ||
|
|
||
| async def __aenter__(self): | ||
| self._session = aiohttp.ClientSession() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that hooks are not necessarily run in the same process, so if you want to share sessions among them, you must move the abstraction to the trigger instead.
I wonder how viable it’d be to refactor the sychronous version (DatabricksHook) into a sans I/O base class that can be used by DatabricksExecutionTrigger, instead of implementing a new, entirely separate hook for it.
alexott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment about auth options
| raise AirflowException(error_message) | ||
|
|
||
|
|
||
| class DatabricksRunNowOperator(BaseOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe slightly outside the scope of the particular pull request.
It would be nice if implement on the on_kill method too.
When a process is cancelled or marked as failed, it should also cancel the running databricks job run.
To do this, you will need to push the run_id to xcom and retrieve it in the on_kill method. I'm not sure if this is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tried to play with the deferrable operator's behaviour when a task is killed, and it might be there's a bug with it.
If a task was killed while being executed by Triggerer, there's no log available for this task after kill. Later on, if the same task is started again, it finishes immediately, like it was continued after being deferred. Will raise an issue with detailed description how to reproduce it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I can see how that might leave things in a strange state. We may need to add code to do better cleanup when a task is killed-while-deferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewgodwin @alexott
Any thoughts on how this should be handled?
| """ | ||
| method, endpoint = endpoint_info | ||
|
|
||
| self.databricks_conn = self.get_connection(self.databricks_conn_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch out - this is a blocking call, as it does a database fetch/secrets backend connection behind the scenes. You need to wrap it in something to make it async-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Maybe there is a way to at least make "airflows" blocking methods somewhat flagged when this happens ? Or find a way to forbid them when we are in async context @andrewgodwin ?
If you recall, this is the very case I was afraid it's too easy to "slip through" when people start making more use
of the deffered operators and use it in their custom ones which do not pass through watchful eyes of yours and other committers of Airflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. We could always lift what we did with Django for this very problem: https://github.com/django/django/blob/37d9ea5d5c010d54a416417399344c39f4e9f93e/django/utils/asyncio.py#L8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ it. I think there are a number of "likely to use" methods in Airflow that we could mark as async_unsafe with decorators, anticipating they will cause troubles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we could revert the logic :) and assume by default, any airflow call is unsafe (unless decorated with @async_safe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with turning async safety onto everything by default is that you need something like the decorator to be in the middle of the call to do the check - not sure we can reasonably override every Airflow API function without some seriously dark magic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Basically you'd have to intercept every method call - while possible, it woudl be terrible performance penalty
97e0110 to
44ef54b
Compare
|
Updated and rebased the PR, implementing new auth methods for Azure AD, and making improvements based on the comments in the PR. |
44ef54b to
d327753
Compare
1870f68 to
7893d94
Compare
|
This operator will not be clearly importing on Airflow 2.1 (see the errors in tests). Airlfow.trigger is not available in Airlfow 2.1. There are two approaches we can take:
I am for the 2nd solution as there are quite a few non-defferable operators there that shoudl continue to work. But I would love to hear what others think. |
Thank you for this point, true, the deferrable operators functionality is only available starting 2.2 version. The second approach with try/catch import errors sounds right to me. I'd be glad to work on it, if others agree upon this approach. |
Thank our CI :) . It simply caught this ! |
|
Thinking loud about the second approach with trying and catching exceptions:
I was thinking about using the abstract class as a substitute of non-existing classes if |
Not really. We can - in this case - set BaseTrigger, TriggerEvent to None add the warning to "known warnings". As long as those classes are not actually 'run' during parsing the class (they shoud not be). It will work just fine. We've done that in other places: |
7893d94 to
d3bdf79
Compare
|
I've rebased the PR and resolved the conflicts, adapting operators/hooks to the recent changes and added a warning in case of using Deferrable Operators on Airflow prior to 2.2 version. |
|
Sorry for bothering, just wanted to check if I could do anything here to improve the PR? Trying to summarise the points discussed:
I've also changed the initial approach for creating async hook methods - firstly I thought about creating class |
d3bdf79 to
d0abe31
Compare
|
Resolved the conflicts and rebased the PR. |
|
Tests are still failing |
d0abe31 to
1d87620
Compare
|
Some changes are merged so the conflicts need to be resolved. |
1d87620 to
3856470
Compare
|
The PR is rebased, together with resolving conflicts. |
3856470 to
76ab794
Compare
|
Resolved conflicts, rebased PR. |
|
I do not know that much about deferrable operators yet to commetn, but at least from the quick look it looks good @andrewgodwin ? |
76ab794 to
3472679
Compare
|
There were some conflicts, resolved them and rebased. |
3472679 to
9b40c02
Compare
|
Rebased and resolved conflicts. |
|
@alexott - any more comments ? |
|
sorry for delay, will have time only over weekend to review |
alexott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, I'm only a bit worried about code duplication in do_api_call and async version of it. Maybe need to do some refactoring later...
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
|
LGTM. We can remove duplication later I think :) |
|
Thank you everyone for your reviews, ideas and activity! :) |
closes: #18999
The PR intends to add deferrable versions of
DatabricksSubmitRunOperatorandDatabricksRunNowOperator, which are using new Airflow functionality introduced with the version 2.2.0.There're not so many examples of deferrable operators at the moment, so I'd appreciate if we discuss the following:
If we update existing operators right away, we'll break backward compatibility for Airflow versions prior 2.2.0, so I thought it'd be better to introduce new operators at the moment.
execution_timeoutafter being deferred #19382