First draft of blocked task dependency explainer by aoen · Pull Request #1435 · apache/airflow

aoen · 2016-04-26T23:09:30Z

THIS IS A DRAFT WIP PULL REQUEST
This is a draft for a solution to #1383 among several other things.
There are tons of things broken/missing here, I'm just putting this out there to get some initial eyes on the high-level changes.

Goals

Simplify, consolidate, and make consistent the logic of whether or not a task should be run
Provide a view that gives insight into why a task instance is not currently running (no more viewing the scheduler logs to find out why a task instance isn't running for the majority of cases)
e.g. (this will not be the final product):

Things to review in this PR

The high level idea of refactoring of all of the dependency checking for regular task execution, backfills, etc into one single function
Any of the functional changes (not the code itself) made listed in "Some big changes made in this PR"
The design of the BaseTIDep class
Any changes in design that would require a complete rewrite of the current changes (where it would be non-trivial to change the code in the PRs diff to conform with the new design)

Things NOT to review in this PR

The view layer for surfacing the failing dependencies to users
Bugs/debugging statements that should be removed/lack of tests/lack of comments/etc.
Failing CI/merge conflicts, due to the somewhat large scope of this change I will iterate on a fixed SHA on master and then either rebase all at once or break the PR up into smaller parts and release them individually
Performance/DB hits (Max had some valid concerns and I will test for this and optimize/cache as appropriate)

Some big changes made in this PR

Paused DAG checking now occurs at the task instance level in models.py instead of in jobs.py to consolidate the "should this task run" logic into one place
Pool behavior in backfills is now consistent with how regular tasks are pooled (pools are always respected in backfills).
This will break one use case:
Using pools to restrict some resource on airflow executors themselves (rather than an external resource like a DB), e.g. some task uses 60% of cpu on a worker so we restrict that task's pool size to 1 to prevent two of the tasks from running on the same host. When backfilling a task of this type, now the backfill will wait on the pool to have slots open up before running the task even though we don't need to do this if backfilling on a different host outside of the pool. I think breaking this use case is OK since the use case is a hack due to not having a proper resource isolation solution (e.g. mesos should be used in this case instead).
To make things less confusing for users, force running a task will now override/ignore:
- task instance's pool being full
- execution date for a task instance being in the future
- a task instance being in the retry waiting period
- the task instance's task ending prior to the task instance's execution date
- task instance is already queued
- task instance has already succeeded
- task instance is in the shutdown state
- WILL NOT OVERRIDE task instance is already running
SLA miss emails will now include all tasks that did not finish for a particular DAG run, even if the tasks didn't run because depends_on_past was not met for a task
Failed tasks will no longer be considered "runnable", they must be force-run or cleared first from the UI to be run, just like successful tasks

I can revert these changes in behavior but it will make the code messier and I think it's a huge win to be consistent (both for code simplicity and for intuitiveness for users) across the board as much as possible when the question "should this task be run right now?" is asked.

Future Work (will not be released in the first version)

Show one more dagrun in the tree view (even when that dagrun's execution date hasn't occured yet) so that users can see why task instances in the next dagrun are blocked. I wanted to include this in the first release but it's non-trivial to do (one way to solve this is to generate DAG runs before their execution date occurs).
Break down failed dependency explanations better (e.g. if the trigger rules requiring all upstream task successes fails, indicate the specific upstream tasks that failed)
Make failing dependencies for task instances more visible to users (e.g. asyncrhonously display failing dependencies when mousing over a task instance in the tree view).
Parallelize task instance dependency checking.
Indicate that a task isn't running because it was manually cleared: if a task is manually cleared the scheduler won't automatically rerun it, but there is no way of knowing this for sure without some kind of flag that marks the task state as cleared.
Can potentially add tips to fix a failing dependency in the new view that shows failed task instance dependencies (e.g. for "depends_on_past" when the last task state is failing we can give the tip to get this task to succeed, or even provide a button to clear it right from that view)
Additional task instance dependencies, e.g. there are no schedulers running (a task can't get scheduled until a scheduler is running), or the airflow scheduler queue is backed up
Forcing tasks in the UI should give users feedback on why their task didn't get forced (e.g. task was already running)
Keep history of changes to blocked dependencies that are accessible via the UI (this is useful to know why a task's execution was delayed)

@mistercrunch @jlowin @plypaul

landscape-bot · 2016-04-26T23:09:34Z

Repository health decreased by 1% when pulling 3a8a05a on ddavydov/why_isnt_my_task_running_view into 1af41d9 on master.

92 new problems were found (including 7 errors and 18 code smells).
28 problems were fixed (including 2 errors and 10 code smells).

jlowin · 2016-04-26T23:55:20Z

I'm really excited about this 👍

landscape-bot · 2016-04-27T00:35:19Z

Repository health decreased by 0.70% when pulling 64ec80b on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

83 new problems were found (including 4 errors and 10 code smells).
32 problems were fixed (including 2 errors and 13 code smells).

landscape-bot · 2016-04-27T18:22:19Z

Repository health decreased by 0.67% when pulling 57fd9e4 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

78 new problems were found (including 4 errors and 9 code smells).
33 problems were fixed (including 2 errors and 13 code smells).

landscape-bot · 2016-04-27T18:28:16Z

Repository health decreased by 0.84% when pulling 1557a01 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

84 new problems were found (including 5 errors and 10 code smells).
33 problems were fixed (including 2 errors and 13 code smells).

mistercrunch · 2016-04-27T23:27:20Z

airflow/models.py

 from sqlalchemy.ext.declarative import declarative_base, declared_attr
 from sqlalchemy.dialects.mysql import LONGTEXT
-from sqlalchemy.orm import relationship, synonym
+from sqlalchemy.orm import reconstructor, relationship, synonym


first time I see @reconstructor, SQLalchemy is one of these bottomless libs that keep on giving

Agreed, it's pretty cool! I think at some point it would be nice to always have a "full" state of TaskInstance if possible (rather than having half of the properties filled in from the DB but missing the rest like task and vice versa when the task instance is constructed using the construcor), maybe by lazy loading properties (e.g. once a DB column is referenced call refresh_from_db automatically).

landscape-bot · 2016-04-27T23:35:10Z

Repository health decreased by 0.54% when pulling 6a22e75 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

97 new problems were found (including 4 errors and 12 code smells).
35 problems were fixed (including 2 errors and 13 code smells).

mistercrunch · 2016-04-27T23:36:03Z

airflow/ti_deps/ti_deps.py

+        return all(status.passed for status in
+                   cls.get_dep_statuses(
+                       ti,
+                       session,


You can look into the @utils.provide_session decorator that basically allows you to pass an active session, but if you don't is instantiate one and takes care of closing it at the end.

Noted, I want to save you time by asking you not to go to deep with this review though (I had this issue on my TODO list already, along with improper use of session.commit in are_dependencies_met). I think just asking you to review the things in the "Things to review in this PR" section in the PR description would be more sensitive to your time, although I'm certainly not complaining about a more granular review since you could catch something I didn't already/wouldn't have noticed.

mistercrunch · 2016-04-28T00:34:58Z

amazement

Notes:

I'd rather force not override "already running", unless it kills and cleans up first, and that's tricky, we'd probably need a new state "shutdown_and_restart", where the task picks up the shutdown, handles it, and marks itself as runnable somehow
SLA is really just about whether effectively you got your task to succeed by a certain time. IE: If core_data isn't deliver by 9AM, we have missed our SLA and should get an email
DAG pause evaluation is more efficient at a higher level, but I totally see how it makes the dependency engine nice to have it all in one place. People have asked for a task level pause state before and that would belong here nicely. Maybe DAG pause/unpause could effectively toggle all tasks level pause when that's implemented, but that's for another PR. I wouldn't worry about scheduler performance as we'll soon run multiple multiprocessing schedulers.

mistercrunch · 2016-04-28T00:46:38Z

ignore_depends_on_past is kind of a nasty thing, it's mostly there to allow for remote backfills to allow for the start_date of the backfill to run

mistercrunch · 2016-04-28T01:04:02Z

I haven't gotten to the bottom of this thought yet, but I feel like get_dep_statuses should only receive a task instance and maybe some run_context object (TBD). I'd hate to add a new run_context item and have to go copy paste it into all these methods that are not using it...

Trying to sort out the other args of get_dep_statuses:

session: needed, let's use @utils.provide_session
include_queued: haven't dug in this yet
ignore_depends_on_past: maybe part of some run context, it's related to the backfill's start_date
flag_upstream_failed: this should be insulated in some routine that's not dependency related, maybe something like "infer_states", that would just take care of all the cases where the state can just be inferred and assigned in place, without actually running the task

aoen · 2016-04-28T02:12:13Z

Much thanks for the in-depth review!

I'd rather force not override "already running", unless it kills and cleans up first, and that's tricky, we'd probably need a new state "shutdown_and_restart", where the task picks up the shutdown, handles it, and marks itself as runnable somehow

Is the scenario you are worrying about (two workers running the same task instance) already possible? For example if a worker's communication with the DB gets interrupted, then the scheduler assigns the task instance to a new worker, and then the communication between the initial worker and the DB resumes.

SLA is really just about whether effectively you got your task to succeed by a certain time. IE: If core_data isn't deliver by 9AM, we have missed our SLA and should get an email

This makes sense. I misspoke in the PR description though, SLAs should still be sent, the difference would be the SLA email would now omit task instances in the dagrun that didn't succeed for reasons other than depends_on_past not being met (e.g. a task that couldn't run because it's pool was full won't get reported in the email). I think I'm going to just include all tasks that don't have a successful status in the SLA miss email, even those stuck on depends_on_past to align with your criteria (if the task caused core_data to not be delivered by 9AM the task caused the DAG to miss it's SLA regardless of it's depends_on_past_dependency), plus is stops treating depends_on_past differently from the other dependencies like the pool being full. LMK what you think.

DAG pause evaluation is more efficient at a higher level, but I totally see how it makes the dependency engine nice to have it all in one place. People have asked for a task level pause state before and that would belong here nicely. Maybe DAG pause/unpause could effectively toggle all tasks level pause when that's implemented, but that's for another PR. I wouldn't worry about scheduler performance as we'll soon run multiple multiprocessing schedulers.

Agreed about the efficiency, was going to look into caching if this causes perf issues.

ignore_depends_on_past is kind of a nasty thing, it's mostly there to allow for remote backfills to allow for the start_date of the backfill to run

The newfound power of the force flag could be used instead of ignore_depends_on_past, but making "force" the default for every backfill could potentially be a bit dangerous as users could e.g. unintentionally force run over a large range of already successful tasks in a backfill or violate a pool constraint. If you have any ideas let me know.

I haven't gotten to the bottom of this thought yet, but I feel like get_dep_statuses should only receive a task instance and maybe some run_context object (TBD). I'd hate to add a new run_context item and have to go copy paste it into all these methods that are not using it...
Trying to sort out the other args of get_dep_statuses:
session: needed, let's use @utils.provide_session
include_queued: haven't dug in this yet
ignore_depends_on_past: maybe part of some run context, it's related to the backfill's start_date
flag_upstream_failed: this should be insulated in some routine that's not dependency related, maybe something like "infer_states", that would just take care of all the cases where the state can just be inferred and assigned in place, without actually running the task

Agreed about not passing in a bunch of different flags. There is actually a TODO above that part of the code in the PR to use a context parameter instead (it will be addressed in this PR).

For the flag upstream_failed I would prefer to leave the fix for another PR since it was an existing hack and the cope of this PR is already a bit dangerously large.

landscape-bot · 2016-04-28T02:55:49Z

Repository health decreased by 1% when pulling c26c798 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

25 new problems were found (including 2 errors and 6 code smells).
33 problems were fixed (including 2 errors and 13 code smells).

First draft of blocked task dependency explainer

64ec80b

aoen force-pushed the ddavydov/why_isnt_my_task_running_view branch from 3a8a05a to 64ec80b Compare April 27, 2016 00:31

Some iterating

57fd9e4

Some iterating

1557a01

aoen added 2 commits April 27, 2016 11:39

Some iterating

63d4b82

Some iterating

2f01465

mistercrunch reviewed Apr 27, 2016
View reviewed changes

Some iterating

6a22e75

mistercrunch reviewed Apr 27, 2016
View reviewed changes

aoen added 2 commits April 27, 2016 19:52

Some iterating

db292d3

Some iterating

c26c798

jlowin added the Missing JIRA Issue label May 2, 2016

aoen closed this May 19, 2016

aoen mentioned this pull request May 19, 2016

[AIRFLOW-149] First draft of blocked TI dependency explainer #1525

Closed

asfgit deleted the ddavydov/why_isnt_my_task_running_view branch January 4, 2017 15:48

Conversation

aoen commented Apr 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

landscape-bot commented Apr 26, 2016

Uh oh!

jlowin commented Apr 26, 2016

Uh oh!

landscape-bot commented Apr 27, 2016

Uh oh!

landscape-bot commented Apr 27, 2016

Uh oh!

landscape-bot commented Apr 27, 2016

Uh oh!

mistercrunch Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

aoen Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

landscape-bot commented Apr 27, 2016

Uh oh!

mistercrunch Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

aoen Apr 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mistercrunch commented Apr 28, 2016

Uh oh!

mistercrunch commented Apr 28, 2016

Uh oh!

mistercrunch commented Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aoen commented Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

landscape-bot commented Apr 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aoen commented Apr 26, 2016 •

edited

Loading

aoen Apr 28, 2016 •

edited

Loading

aoen Apr 27, 2016 •

edited

Loading

mistercrunch commented Apr 28, 2016 •

edited

Loading

aoen commented Apr 28, 2016 •

edited

Loading