Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Dec 1, 2021

As described in detail in #19860, there was a race condition in
starting and terminating DagProcessorAgent that caused us a lot
of headeaches with flaky test_scheduler_job failures on our CI
and after long investigation, it turned out to be a race
condition. Not very likely, but possible to happen in production.

The race condition involved starting DagProcessorAgent via
multiprocessing, where the first action of the agent was changing
the process GID to be the same as PID. If the DagProcessorAgent
was terminated quickly (on a busy system) before the process
could change the GID, the reap_process_group that was supposed
to kill the whole group, was failing and the DagProcessorAgent
remained running.

This problem revealed a wrong behaviour of Airflow in some edge
conditions when 'spawn' mode was used for starting the DAG processor
Details are described in #19934, but this problem will have to be
solved differently (avoiding ORM reinitialization during DAG
processor starting).

This change also moves the tests for spawn method out from
test_scheduler_job.py (it was a remnant of old Airlfow and it
did not really test what it was supposed to test). Instead tests
were added for different spawn modes and killing the processor
agent in both spawn and "default" mode.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Dec 1, 2021
@potiuk potiuk force-pushed the fix-race-condition-with-killing-processor-mnager branch from e1adfaa to 067cc80 Compare December 1, 2021 16:36
@potiuk potiuk changed the title Fix race condition with killing processor mnager Fix race condition when starting DagProcessorAgent Dec 1, 2021
@potiuk
Copy link
Member Author

potiuk commented Dec 1, 2021

Looks really good with all tests green !!! :). Cannot wait to merge it and rebase the last 'stability' one :)

@potiuk potiuk force-pushed the fix-race-condition-with-killing-processor-mnager branch 3 times, most recently from fa9587c to b309675 Compare December 1, 2021 23:07
As described in detail in apache#19860, there was a race condition in
starting and terminating DagProcessorAgent that caused us a lot
of headeaches with flaky test_scheduler_job failures on our CI
and after long investigation, it turned out to be a race
condition. Not very likely, but possible to happen in production.

The race condition involved starting DagProcessorAgent via
multiprocessing, where the first action of the agent was changing
the process GID to be the same as PID. If the DagProcessorAgent
was terminated quickly (on a busy system) before the process
could change the GID, the `reap_process_group` that was supposed
to kill the whole group, was failing and the DagProcessorAgent
remained running.

This problem revealed a wrong behaviour of Airflow in some edge
conditions when 'spawn' mode was used for starting the DAG processor
Details are described in apache#19934, but this problem will have to be
solved differently (avoiding ORM reinitialization during DAG
processor starting).

This change also moves the tests for `spawn` method out from
test_scheduler_job.py (it was a remnant of old Airlfow and it
did not really test what it was supposed to test). Instead tests
were added for different spawn modes and killing the processor
agent in both spawn and "default" mode.
@potiuk potiuk force-pushed the fix-race-condition-with-killing-processor-mnager branch from b309675 to fdbcdd2 Compare December 2, 2021 18:57
@github-actions
Copy link

github-actions bot commented Dec 2, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Dec 2, 2021
@potiuk potiuk merged commit 5254843 into apache:main Dec 2, 2021
@potiuk potiuk deleted the fix-race-condition-with-killing-processor-mnager branch December 2, 2021 23:45
@potiuk
Copy link
Member Author

potiuk commented Dec 2, 2021

Woohooo!

@jedcunningham jedcunningham added this to the Airflow 2.2.3 milestone Dec 11, 2021
@jedcunningham jedcunningham added the type:bug-fix Changelog: Bug Fixes label Dec 11, 2021
potiuk added a commit that referenced this pull request Dec 11, 2021
As described in detail in #19860, there was a race condition in
starting and terminating DagProcessorAgent that caused us a lot
of headeaches with flaky test_scheduler_job failures on our CI
and after long investigation, it turned out to be a race
condition. Not very likely, but possible to happen in production.

The race condition involved starting DagProcessorAgent via
multiprocessing, where the first action of the agent was changing
the process GID to be the same as PID. If the DagProcessorAgent
was terminated quickly (on a busy system) before the process
could change the GID, the `reap_process_group` that was supposed
to kill the whole group, was failing and the DagProcessorAgent
remained running.

This problem revealed a wrong behaviour of Airflow in some edge
conditions when 'spawn' mode was used for starting the DAG processor
Details are described in #19934, but this problem will have to be
solved differently (avoiding ORM reinitialization during DAG
processor starting).

This change also moves the tests for `spawn` method out from
test_scheduler_job.py (it was a remnant of old Airlfow and it
did not really test what it was supposed to test). Instead tests
were added for different spawn modes and killing the processor
agent in both spawn and "default" mode.

(cherry picked from commit 5254843)
@potiuk potiuk restored the fix-race-condition-with-killing-processor-mnager branch April 26, 2022 20:45
@potiuk potiuk deleted the fix-race-condition-with-killing-processor-mnager branch July 29, 2022 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants