Skip to content

Conversation

@mhenc
Copy link
Contributor

@mhenc mhenc commented Mar 16, 2022

This change introduces new cli command: 'dag-processor' which runs DagProcessorManager as a standalone process.

Part of AIP-43
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation

@boring-cyborg boring-cyborg bot added area:CLI area:Scheduler including HA (high availability) scheduler labels Mar 16, 2022
@mhenc
Copy link
Contributor Author

mhenc commented Mar 16, 2022

cc: @potiuk

@mhenc mhenc force-pushed the standalone_dp branch 6 times, most recently from 652577d to d5e21ad Compare March 21, 2022 14:09
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. But I sense some simplification possibility.

I understand why _standalone_mode is set as field in the manager, but it is in fact unnecessary and requires to make some mental correlation between "standalone_mode" and "signal_conn". I think the code will be a little cleaner if there is only "signal_conn" used.

Maybe name it differently to make it more obvious: direct_scheduler_conn or something ? This way the if's will be more obvious:

if not self._direct_scheduler_conn:

or

if self._direct_scheduler_conn:

@mhenc mhenc force-pushed the standalone_dp branch 3 times, most recently from 49d0b03 to 4bdb0e3 Compare March 22, 2022 20:41
@mhenc mhenc requested a review from potiuk March 22, 2022 20:56
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice. I think we need to add some documentation update (in the architecture mostly) to explain it but It can/should be done as separate PR.

@potiuk
Copy link
Member

potiuk commented Mar 24, 2022

Since this is a core part of scheduler behaviour we definitely need another commiter's review. Anyone?

@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Mar 24, 2022
@potiuk
Copy link
Member

potiuk commented Mar 24, 2022

BTW. It's been a bit easier than I thought it will be it seems :)

@mhenc
Copy link
Contributor Author

mhenc commented Mar 25, 2022

I will update documentation after we run more tests - of course I ran my own tests, but I hope someone else verify it as well, before we add it to the public docs.

@mhenc mhenc force-pushed the standalone_dp branch 3 times, most recently from 773a8bc to b1e0a12 Compare March 29, 2022 09:45
@potiuk
Copy link
Member

potiuk commented Apr 1, 2022

Do we want to have this in 2.3?

I think so. This is already an improvement in security and isolation. While it is still far from the the "true" multi-tenancy, it already gives some features of multitenancy.

This is still optional and without the DB isolation it does not give a lot of "security". but it already gives users interesting possibilities - for example if they have really a lot of DAGs to parse, but the do not want to run multiple schedulers - they could have a single or two schedulers and for example 10 dags-processors.

The additional security you get here is important. By employing DAG processor as separate process running in separate machine (or even separate security zone) you finally reach something that even some of scenarios where DAG writers could exercise too much power on scheduler. By adding DAG processor separation those scenarios are impossible and DAG writer is not able to execute any code on scheduler any more.

Additionally that will give the users an opportunity to isolate code executed by different groups of people. It does not secure a database access yet, but if you run separate DAG processors for separate subdirs with different "write" access, that already gives the user an option to run separate DAG processor in separate sub-folder you won't be able to execute a code on the same machine as "other group" of yours. Surely - you will be able to do in tasks, but if you use Kubernetes Executor, those are executed in separate PODs, so you will be able to already achive "complete code execution" isolation for multiple groups of people.

I think - Google wants to get it deployed anyway, so they might want to take alpha/beta/RC candidates of 2.3 and hammer it to make it really robust and tested, and the sooner that feature gets into hands of the users, the more prepared we will be for next steps of multi-tenancy introduction. It will give us a chance to iron out all the wrinkles in case we find them before we go deeper into "full" multi-tenancy.

@potiuk
Copy link
Member

potiuk commented Apr 1, 2022

Of course we need a bit more documentation (but we can very quickly add the docs describing the use cases and architecture and benefits of it). I am happy to prepare some of that actually if we need to do it quickly - even for Alpha of 2.3.0.

Also looking at the "size" and complexity of this overal change, I think the risks connected with this change are rather small.

@potiuk potiuk added the AIP-43 DAG processor separation AIP-43 label Apr 1, 2022
@potiuk potiuk merged commit f5f11ae into apache:main Apr 1, 2022
)
with ctx:
try:
manager.register_exit_signals()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also call.manager.start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I prepared a fix #22720
Thanks!

self.waitables.pop(sentinel)
self._processors.pop(processor.file_path)

if conf.getboolean("scheduler", "standalone_dag_processor"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should read configuration option values out of the loop because they are not cached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also addressed in #22720

mhenc added a commit to mhenc/airflow that referenced this pull request Apr 4, 2022
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache#22305 (comment)
@potiuk
Copy link
Member

potiuk commented Apr 4, 2022

Good eyes @mik-laj !

mik-laj pushed a commit that referenced this pull request Apr 4, 2022
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
#22305 (comment)
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Apr 8, 2022
@mhenc mhenc deleted the standalone_dp branch July 4, 2022 12:04
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jul 10, 2022
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
@marclamberti
Copy link

I love this feature!
Can we already run different dag-processors in different folders or not yet?
I tried to do that, but my dags keep getting replaced by the ones parsed by the latest dag-processor running

@potiuk
Copy link
Member

potiuk commented Aug 1, 2022

I love this feature! Can we already run different dag-processors in different folders or not yet? I tried to do that, but my dags keep getting replaced by the ones parsed by the latest dag-processor running

Not yet. This is something @mhenc is plannig to add to complete AIP-43.

@ashb ashb added this to the Airflow 2.3.0 milestone Aug 5, 2022
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Aug 30, 2022
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 4, 2022
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
aglipska pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 8, 2022
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Dec 7, 2022
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jan 27, 2023
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
kosteev pushed a commit to kosteev/composer-airflow-test-copybara that referenced this pull request Sep 12, 2024
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 18, 2024
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 7, 2024
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 2, 2025
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 23, 2025
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 18, 2025
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 16, 2025
+ move reading [scheduler]standalone_dag_processor outside of the loop

See
apache/airflow#22305 (comment)

GitOrigin-RevId: 215993b75d0b3a568b01a29e063e5dcdb3b963e1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AIP-43 DAG processor separation AIP-43 area:CLI area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:new-feature Changelog: New Features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants