Skip to content

Conversation

@bolkedebruin
Copy link
Contributor

@bolkedebruin bolkedebruin commented Jun 9, 2016

Dear Airflow Maintainers,

Please accept this PR that addresses the following issues:

Tasks can get orphaned if the scheduler is killed in the middle
of processing the tasks or if the MQ queue is cleared without
a worker having picked these up. Now tasks do not get set
to a scheduled state anymore if they have not been sent to the
executor yet. Next to that a garbage collector scans the executor
for tasks not being present and reschedules those if needed.

@bolkedebruin
Copy link
Contributor Author

This addresses issues discussed in: #1514 , @plypaul

@bolkedebruin bolkedebruin changed the title Collect orphaned tasks and reschedule them [AIRFLOW-224] Collect orphaned tasks and reschedule them Jun 9, 2016
@codecov-io
Copy link

codecov-io commented Jun 9, 2016

Current coverage is 68.13%

Merging #1581 into master will increase coverage by 0.12%

@@             master      #1581   diff @@
==========================================
  Files           116        116          
  Lines          8311       8330    +19   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           5653       5676    +23   
+ Misses         2658       2654     -4   
  Partials          0          0          

Powered by Codecov. Last updated by 6270dcf...fb89276

@bolkedebruin
Copy link
Contributor Author

@plypaul I'm working on adding some tests. Please let me know if you think this approach covers our discussion.

airflow/jobs.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since garbage collection is often referred in the memory context, can we rename this? Maybe _reset_state_for_orphaned_tasks?

@plypaul
Copy link
Contributor

plypaul commented Jun 10, 2016

Nice work! This looks like it will handle the cases we talked about in the other PR.

@bolkedebruin
Copy link
Contributor Author

bolkedebruin commented Jun 10, 2016

@plypaul should be ready for review. Build seems ok (postgres is not failing here on 3.4: https://travis-ci.org/bolkedebruin/airflow)

@mistercrunch
Copy link
Member

I haven't been following all the development on the scheduler but what I see here looks good to me. I'm really excited to see flaws I baked into the scheduler a long time ago getting addressed!

I think @plypaul is probably the best person to review this part of the code at this point though.

@bolkedebruin
Copy link
Contributor Author

@plypaul ?

airflow/jobs.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support multiple schedulers officially? Without locks, there a lot of race conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple people that do run multiple schedulers and we don't officially discourage it either. The updates to the scheduler I made make it almost safe to do so (only UP_FOR_RETRY tasks remain to have that issue).

@plypaul
Copy link
Contributor

plypaul commented Jun 15, 2016

Sorry, I was a little backlogged there. The PR looks good to me. One thought I had was that finding orphaned tasks should only be needed at startup to bootstrap the executor state.

@bolkedebruin
Copy link
Contributor Author

Np. Good point actually. I'll move it to the start of the scheduler/executor.

@bolkedebruin
Copy link
Contributor Author

@plypaul ok moved the logic to only run after start of the executor.

Tasks can get orphaned if the scheduler is killed in the middle
of processing the tasks or if the MQ queue is cleared without
a worker having picked these up. Now tasks do not get set
to a scheduled state anymore if they have not been sent to the
executor yet. Next to that a garbage collector scans the executor
for tasks not being present and reschedules those if needed.
@plypaul
Copy link
Contributor

plypaul commented Jun 15, 2016

👍

@asfgit asfgit merged commit fb89276 into apache:master Jun 16, 2016
@bolkedebruin
Copy link
Contributor Author

@plypaul wonder if we cover your use case of clearing the MQ by moving it to the start of the executor in case you do not use "num_runs". But we can fix that easily by moving it again.

session=session
)
for dr in active_runs:
self._reset_state_for_orphaned_tasks(dr, session=session)
Copy link
Contributor

@plypaul plypaul Jun 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be quite slow - we have several thousand DAG runs, so when it starts, several minutes are spent going through this loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plypaul curious--did you guys work around this? Is there a patch to fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldnt break a sweat on a couple of thousand. Where exactly is it slow? In the db or in Python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check if it is proper indexed in the db. If it isn't it might need one. Otherwise I can rework it to fully run in the db in one go. That can happen in one statement I think. If you @plypaul can let me know some metrics that would make it easier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bolkedebruin pretty sure this is related to the 'add index on task state' PR that's in flight right now :)

Copy link
Contributor Author

@bolkedebruin bolkedebruin Jun 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@criccomini @plypaul think so too, I double checked and I think the query that is being run indeed will not use an index (index is on: dag_id, task_id, execution_date, state - task_id is missing for this query so it wont hit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants