-
Notifications
You must be signed in to change notification settings - Fork 16.4k
[AIRFLOW-224] Collect orphaned tasks and reschedule them #1581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 68.13%@@ master #1581 diff @@
==========================================
Files 116 116
Lines 8311 8330 +19
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 5653 5676 +23
+ Misses 2658 2654 -4
Partials 0 0
|
|
@plypaul I'm working on adding some tests. Please let me know if you think this approach covers our discussion. |
airflow/jobs.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since garbage collection is often referred in the memory context, can we rename this? Maybe _reset_state_for_orphaned_tasks?
|
Nice work! This looks like it will handle the cases we talked about in the other PR. |
91bf13a to
6e92230
Compare
|
@plypaul should be ready for review. Build seems ok (postgres is not failing here on 3.4: https://travis-ci.org/bolkedebruin/airflow) |
|
I haven't been following all the development on the scheduler but what I see here looks good to me. I'm really excited to see flaws I baked into the scheduler a long time ago getting addressed! I think @plypaul is probably the best person to review this part of the code at this point though. |
87016c0 to
e160ce0
Compare
|
@plypaul ? |
airflow/jobs.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we support multiple schedulers officially? Without locks, there a lot of race conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple people that do run multiple schedulers and we don't officially discourage it either. The updates to the scheduler I made make it almost safe to do so (only UP_FOR_RETRY tasks remain to have that issue).
|
Sorry, I was a little backlogged there. The PR looks good to me. One thought I had was that finding orphaned tasks should only be needed at startup to bootstrap the executor state. |
|
Np. Good point actually. I'll move it to the start of the scheduler/executor. |
e160ce0 to
9503dc9
Compare
|
@plypaul ok moved the logic to only run after start of the executor. |
Tasks can get orphaned if the scheduler is killed in the middle of processing the tasks or if the MQ queue is cleared without a worker having picked these up. Now tasks do not get set to a scheduled state anymore if they have not been sent to the executor yet. Next to that a garbage collector scans the executor for tasks not being present and reschedules those if needed.
9503dc9 to
fb89276
Compare
|
👍 |
|
@plypaul wonder if we cover your use case of clearing the MQ by moving it to the start of the executor in case you do not use "num_runs". But we can fix that easily by moving it again. |
| session=session | ||
| ) | ||
| for dr in active_runs: | ||
| self._reset_state_for_orphaned_tasks(dr, session=session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turns out to be quite slow - we have several thousand DAG runs, so when it starts, several minutes are spent going through this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plypaul curious--did you guys work around this? Is there a patch to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldnt break a sweat on a couple of thousand. Where exactly is it slow? In the db or in Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check if it is proper indexed in the db. If it isn't it might need one. Otherwise I can rework it to fully run in the db in one go. That can happen in one statement I think. If you @plypaul can let me know some metrics that would make it easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bolkedebruin pretty sure this is related to the 'add index on task state' PR that's in flight right now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@criccomini @plypaul think so too, I double checked and I think the query that is being run indeed will not use an index (index is on: dag_id, task_id, execution_date, state - task_id is missing for this query so it wont hit).
Dear Airflow Maintainers,
Please accept this PR that addresses the following issues:
Tasks can get orphaned if the scheduler is killed in the middle
of processing the tasks or if the MQ queue is cleared without
a worker having picked these up. Now tasks do not get set
to a scheduled state anymore if they have not been sent to the
executor yet. Next to that a garbage collector scans the executor
for tasks not being present and reschedules those if needed.