-
Notifications
You must be signed in to change notification settings - Fork 16.4k
[AIRFLOW-4797] Fix zombie detection #5420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5420 +/- ##
=========================================
- Coverage 79.1% 79.1% -0.01%
=========================================
Files 483 483
Lines 30317 30312 -5
=========================================
- Hits 23983 23977 -6
- Misses 6334 6335 +1
Continue to review full report at Codecov.
|
|
The jira issue talks about
However I don't see anything in the change or original code that changes what DAGs we look for zombies in - it should be all dags. So either this change doesn't fix the problem, or it never behaved as described? In addition to all that, rather than running that every loop I'd feel more comformatable making the default 10s interval a config value, so that for example you could set it to 0 in your install if you wanted. |
|
Yes, correct, the airflow/airflow/utils/dag_processing.py Line 1217 in 93de2ce
n DAG files which receive the list of zombies, subsequent processors for other DAG files just get an empty list.
The list of (all or none) zombies is passed down via airflow/airflow/models/dagbag.py Line 271 in 93de2ce
This is far too complex for such a simple thing like detecting zombie task instances and kill them. Last Friday I debugged 5 hours to find the reason. I thought about if it's not better to remove the zombie detection from |
|
Oh ouch. I'd have to double check the code, but finding zombies doesn't seem relevant to dag processing so moving it does sound sensible. |
|
Ok, I'll give it a try |
| TI = airflow.models.TaskInstance | ||
| limit_dttm = timezone.utcnow() - timedelta( | ||
| seconds=self._zombie_threshold_secs) | ||
| self.log.info("Failing jobs without heartbeat after %s", limit_dttm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to produce quite a lot of logs.
|
Here is the first draft of the alternative approach to move zombie detection down to Missing: more tests, verify SQL query is fast (uses index), find out if throttling makes sense, ... |
Jira
Description
Tests
Commits
Documentation
Code Quality
flake8