-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix race condition when starting DagProcessorAgent #19935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition when starting DagProcessorAgent #19935
Conversation
e1adfaa to
067cc80
Compare
|
Looks really good with all tests green !!! :). Cannot wait to merge it and rebase the last 'stability' one :) |
fa9587c to
b309675
Compare
As described in detail in apache#19860, there was a race condition in starting and terminating DagProcessorAgent that caused us a lot of headeaches with flaky test_scheduler_job failures on our CI and after long investigation, it turned out to be a race condition. Not very likely, but possible to happen in production. The race condition involved starting DagProcessorAgent via multiprocessing, where the first action of the agent was changing the process GID to be the same as PID. If the DagProcessorAgent was terminated quickly (on a busy system) before the process could change the GID, the `reap_process_group` that was supposed to kill the whole group, was failing and the DagProcessorAgent remained running. This problem revealed a wrong behaviour of Airflow in some edge conditions when 'spawn' mode was used for starting the DAG processor Details are described in apache#19934, but this problem will have to be solved differently (avoiding ORM reinitialization during DAG processor starting). This change also moves the tests for `spawn` method out from test_scheduler_job.py (it was a remnant of old Airlfow and it did not really test what it was supposed to test). Instead tests were added for different spawn modes and killing the processor agent in both spawn and "default" mode.
b309675 to
fdbcdd2
Compare
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
|
Woohooo! |
As described in detail in #19860, there was a race condition in starting and terminating DagProcessorAgent that caused us a lot of headeaches with flaky test_scheduler_job failures on our CI and after long investigation, it turned out to be a race condition. Not very likely, but possible to happen in production. The race condition involved starting DagProcessorAgent via multiprocessing, where the first action of the agent was changing the process GID to be the same as PID. If the DagProcessorAgent was terminated quickly (on a busy system) before the process could change the GID, the `reap_process_group` that was supposed to kill the whole group, was failing and the DagProcessorAgent remained running. This problem revealed a wrong behaviour of Airflow in some edge conditions when 'spawn' mode was used for starting the DAG processor Details are described in #19934, but this problem will have to be solved differently (avoiding ORM reinitialization during DAG processor starting). This change also moves the tests for `spawn` method out from test_scheduler_job.py (it was a remnant of old Airlfow and it did not really test what it was supposed to test). Instead tests were added for different spawn modes and killing the processor agent in both spawn and "default" mode. (cherry picked from commit 5254843)
As described in detail in #19860, there was a race condition in
starting and terminating DagProcessorAgent that caused us a lot
of headeaches with flaky test_scheduler_job failures on our CI
and after long investigation, it turned out to be a race
condition. Not very likely, but possible to happen in production.
The race condition involved starting DagProcessorAgent via
multiprocessing, where the first action of the agent was changing
the process GID to be the same as PID. If the DagProcessorAgent
was terminated quickly (on a busy system) before the process
could change the GID, the
reap_process_groupthat was supposedto kill the whole group, was failing and the DagProcessorAgent
remained running.
This problem revealed a wrong behaviour of Airflow in some edge
conditions when 'spawn' mode was used for starting the DAG processor
Details are described in #19934, but this problem will have to be
solved differently (avoiding ORM reinitialization during DAG
processor starting).
This change also moves the tests for
spawnmethod out fromtest_scheduler_job.py (it was a remnant of old Airlfow and it
did not really test what it was supposed to test). Instead tests
were added for different spawn modes and killing the processor
agent in both spawn and "default" mode.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.