-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add dag-processor cli command #22305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @potiuk |
652577d to
d5e21ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. But I sense some simplification possibility.
I understand why _standalone_mode is set as field in the manager, but it is in fact unnecessary and requires to make some mental correlation between "standalone_mode" and "signal_conn". I think the code will be a little cleaner if there is only "signal_conn" used.
Maybe name it differently to make it more obvious: direct_scheduler_conn or something ? This way the if's will be more obvious:
if not self._direct_scheduler_conn:
or
if self._direct_scheduler_conn:
49d0b03 to
4bdb0e3
Compare
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice. I think we need to add some documentation update (in the architecture mostly) to explain it but It can/should be done as separate PR.
|
Since this is a core part of scheduler behaviour we definitely need another commiter's review. Anyone? |
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
|
BTW. It's been a bit easier than I thought it will be it seems :) |
|
I will update documentation after we run more tests - of course I ran my own tests, but I hope someone else verify it as well, before we add it to the public docs. |
773a8bc to
b1e0a12
Compare
I think so. This is already an improvement in security and isolation. While it is still far from the the "true" multi-tenancy, it already gives some features of multitenancy. This is still optional and without the DB isolation it does not give a lot of "security". but it already gives users interesting possibilities - for example if they have really a lot of DAGs to parse, but the do not want to run multiple schedulers - they could have a single or two schedulers and for example 10 The additional security you get here is important. By employing DAG processor as separate process running in separate machine (or even separate security zone) you finally reach something that even some of scenarios where DAG writers could exercise too much power on scheduler. By adding DAG processor separation those scenarios are impossible and DAG writer is not able to execute any code on scheduler any more. Additionally that will give the users an opportunity to isolate code executed by different groups of people. It does not secure a database access yet, but if you run separate DAG processors for separate subdirs with different "write" access, that already gives the user an option to run separate DAG processor in separate sub-folder you won't be able to execute a code on the same machine as "other group" of yours. Surely - you will be able to do in tasks, but if you use Kubernetes Executor, those are executed in separate PODs, so you will be able to already achive "complete code execution" isolation for multiple groups of people. I think - Google wants to get it deployed anyway, so they might want to take alpha/beta/RC candidates of 2.3 and hammer it to make it really robust and tested, and the sooner that feature gets into hands of the users, the more prepared we will be for next steps of multi-tenancy introduction. It will give us a chance to iron out all the wrinkles in case we find them before we go deeper into "full" multi-tenancy. |
|
Of course we need a bit more documentation (but we can very quickly add the docs describing the use cases and architecture and benefits of it). I am happy to prepare some of that actually if we need to do it quickly - even for Alpha of 2.3.0. Also looking at the "size" and complexity of this overal change, I think the risks connected with this change are rather small. |
| ) | ||
| with ctx: | ||
| try: | ||
| manager.register_exit_signals() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also call.manager.start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I prepared a fix #22720
Thanks!
| self.waitables.pop(sentinel) | ||
| self._processors.pop(processor.file_path) | ||
|
|
||
| if conf.getboolean("scheduler", "standalone_dag_processor"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should read configuration option values out of the loop because they are not cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also addressed in #22720
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache#22305 (comment)
|
Good eyes @mik-laj ! |
+ move reading [scheduler]standalone_dag_processor outside of the loop See #22305 (comment)
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
|
I love this feature! |
Not yet. This is something @mhenc is plannig to add to complete AIP-43. |
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
+ move reading [scheduler]standalone_dag_processor outside of the loop See apache/airflow#22305 (comment) GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
This change introduces new cli command: 'dag-processor' which runs DagProcessorManager as a standalone process.
Part of AIP-43
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation