[AIRFLOW-224] Collect orphaned tasks and reschedule them #1581

bolkedebruin · 2016-06-09T11:05:50Z

Dear Airflow Maintainers,

Please accept this PR that addresses the following issues:

https://issues.apache.org/jira/browse/AIRFLOW-224

Tasks can get orphaned if the scheduler is killed in the middle
of processing the tasks or if the MQ queue is cleared without
a worker having picked these up. Now tasks do not get set
to a scheduled state anymore if they have not been sent to the
executor yet. Next to that a garbage collector scans the executor
for tasks not being present and reschedules those if needed.

bolkedebruin · 2016-06-09T11:06:15Z

This addresses issues discussed in: #1514 , @plypaul

codecov-io · 2016-06-09T18:57:14Z

Current coverage is 68.13%

Merging #1581 into master will increase coverage by 0.12%

@@             master      #1581   diff @@
==========================================
  Files           116        116          
  Lines          8311       8330    +19   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           5653       5676    +23   
+ Misses         2658       2654     -4   
  Partials          0          0

Powered by Codecov. Last updated by 6270dcf...fb89276

bolkedebruin · 2016-06-09T19:02:21Z

@plypaul I'm working on adding some tests. Please let me know if you think this approach covers our discussion.

plypaul · 2016-06-10T01:54:55Z

airflow/jobs.py

Since garbage collection is often referred in the memory context, can we rename this? Maybe _reset_state_for_orphaned_tasks?

plypaul · 2016-06-10T01:55:31Z

Nice work! This looks like it will handle the cases we talked about in the other PR.

bolkedebruin · 2016-06-10T14:32:53Z

@plypaul should be ready for review. Build seems ok (postgres is not failing here on 3.4: https://travis-ci.org/bolkedebruin/airflow)

mistercrunch · 2016-06-10T15:01:42Z

I haven't been following all the development on the scheduler but what I see here looks good to me. I'm really excited to see flaws I baked into the scheduler a long time ago getting addressed!

I think @plypaul is probably the best person to review this part of the code at this point though.

bolkedebruin · 2016-06-15T04:42:31Z

@plypaul ?

plypaul · 2016-06-15T07:19:37Z

airflow/jobs.py

I don't think we support multiple schedulers officially? Without locks, there a lot of race conditions.

There are multiple people that do run multiple schedulers and we don't officially discourage it either. The updates to the scheduler I made make it almost safe to do so (only UP_FOR_RETRY tasks remain to have that issue).

plypaul · 2016-06-15T07:20:27Z

Sorry, I was a little backlogged there. The PR looks good to me. One thought I had was that finding orphaned tasks should only be needed at startup to bootstrap the executor state.

bolkedebruin · 2016-06-15T11:40:06Z

Np. Good point actually. I'll move it to the start of the scheduler/executor.

bolkedebruin · 2016-06-15T18:12:04Z

@plypaul ok moved the logic to only run after start of the executor.

Tasks can get orphaned if the scheduler is killed in the middle of processing the tasks or if the MQ queue is cleared without a worker having picked these up. Now tasks do not get set to a scheduled state anymore if they have not been sent to the executor yet. Next to that a garbage collector scans the executor for tasks not being present and reschedules those if needed.

plypaul · 2016-06-15T21:43:12Z

👍

bolkedebruin · 2016-06-16T08:28:19Z

@plypaul wonder if we cover your use case of clearing the MQ by moving it to the start of the executor in case you do not use "num_runs". But we can fix that easily by moving it again.

plypaul · 2016-06-29T03:06:20Z

airflow/jobs.py

+            session=session
+        )
+        for dr in active_runs:
+            self._reset_state_for_orphaned_tasks(dr, session=session)


This turns out to be quite slow - we have several thousand DAG runs, so when it starts, several minutes are spent going through this loop.

@plypaul curious--did you guys work around this? Is there a patch to fix?

It shouldnt break a sweat on a couple of thousand. Where exactly is it slow? In the db or in Python?

I'll check if it is proper indexed in the db. If it isn't it might need one. Otherwise I can rework it to fully run in the db in one go. That can happen in one statement I think. If you @plypaul can let me know some metrics that would make it easier

@bolkedebruin pretty sure this is related to the 'add index on task state' PR that's in flight right now :)

@criccomini @plypaul think so too, I double checked and I think the query that is being run indeed will not use an index (index is on: dag_id, task_id, execution_date, state - task_id is missing for this query so it wont hit).

bolkedebruin mentioned this pull request Jun 9, 2016

AIRFLOW-128 Optimize and refactor process_dag #1514

Merged

bolkedebruin changed the title ~~Collect orphaned tasks and reschedule them~~ [AIRFLOW-224] Collect orphaned tasks and reschedule them Jun 9, 2016

plypaul reviewed Jun 10, 2016
View reviewed changes

bolkedebruin force-pushed the garbage_collector branch from 91bf13a to 6e92230 Compare June 10, 2016 14:24

bolkedebruin force-pushed the garbage_collector branch 2 times, most recently from 87016c0 to e160ce0 Compare June 11, 2016 11:58

bolkedebruin mentioned this pull request Jun 11, 2016

[AIRFLOW-234] make task that aren't running self-terminate #1585

Closed

plypaul reviewed Jun 15, 2016
View reviewed changes

bolkedebruin force-pushed the garbage_collector branch from e160ce0 to 9503dc9 Compare June 15, 2016 18:10

bolkedebruin force-pushed the garbage_collector branch from 9503dc9 to fb89276 Compare June 15, 2016 19:58

asfgit merged commit fb89276 into apache:master Jun 16, 2016

plypaul reviewed Jun 29, 2016
View reviewed changes

[AIRFLOW-224] Collect orphaned tasks and reschedule them #1581

[AIRFLOW-224] Collect orphaned tasks and reschedule them #1581

Uh oh!

Conversation

bolkedebruin commented Jun 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bolkedebruin commented Jun 9, 2016

Uh oh!

codecov-io commented Jun 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 68.13%

Uh oh!

bolkedebruin commented Jun 9, 2016

Uh oh!

plypaul Jun 10, 2016

Choose a reason for hiding this comment

Uh oh!

plypaul commented Jun 10, 2016

Uh oh!

bolkedebruin commented Jun 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mistercrunch commented Jun 10, 2016

Uh oh!

bolkedebruin commented Jun 15, 2016

Uh oh!

plypaul Jun 15, 2016

Choose a reason for hiding this comment

Uh oh!

bolkedebruin Jun 15, 2016

Choose a reason for hiding this comment

Uh oh!

plypaul commented Jun 15, 2016

Uh oh!

bolkedebruin commented Jun 15, 2016

Uh oh!

bolkedebruin commented Jun 15, 2016

Uh oh!

plypaul commented Jun 15, 2016

Uh oh!

bolkedebruin commented Jun 16, 2016

Uh oh!

plypaul Jun 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

criccomini Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

bolkedebruin Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

bolkedebruin Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

criccomini Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

bolkedebruin Jun 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bolkedebruin commented Jun 9, 2016 •

edited

Loading

codecov-io commented Jun 9, 2016 •

edited

Loading

bolkedebruin commented Jun 10, 2016 •

edited

Loading

plypaul Jun 29, 2016 •

edited

Loading

bolkedebruin Jun 30, 2016 •

edited

Loading