Skip to content

First draft of blocked task dependency explainer#1435

Closed
aoen wants to merge 8 commits intomasterfrom
ddavydov/why_isnt_my_task_running_view
Closed

First draft of blocked task dependency explainer#1435
aoen wants to merge 8 commits intomasterfrom
ddavydov/why_isnt_my_task_running_view

Conversation

@aoen
Copy link
Contributor

@aoen aoen commented Apr 26, 2016

THIS IS A DRAFT WIP PULL REQUEST
This is a draft for a solution to #1383 among several other things.
There are tons of things broken/missing here, I'm just putting this out there to get some initial eyes on the high-level changes.

Goals

  • Simplify, consolidate, and make consistent the logic of whether or not a task should be run
  • Provide a view that gives insight into why a task instance is not currently running (no more viewing the scheduler logs to find out why a task instance isn't running for the majority of cases)
    e.g. (this will not be the final product):
    image

Things to review in this PR

  • The high level idea of refactoring of all of the dependency checking for regular task execution, backfills, etc into one single function
  • Any of the functional changes (not the code itself) made listed in "Some big changes made in this PR"
  • The design of the BaseTIDep class
  • Any changes in design that would require a complete rewrite of the current changes (where it would be non-trivial to change the code in the PRs diff to conform with the new design)

Things NOT to review in this PR

  • The view layer for surfacing the failing dependencies to users
  • Bugs/debugging statements that should be removed/lack of tests/lack of comments/etc.
  • Failing CI/merge conflicts, due to the somewhat large scope of this change I will iterate on a fixed SHA on master and then either rebase all at once or break the PR up into smaller parts and release them individually
  • Performance/DB hits (Max had some valid concerns and I will test for this and optimize/cache as appropriate)

Some big changes made in this PR

  • Paused DAG checking now occurs at the task instance level in models.py instead of in jobs.py to consolidate the "should this task run" logic into one place
  • Pool behavior in backfills is now consistent with how regular tasks are pooled (pools are always respected in backfills).
    This will break one use case:
    Using pools to restrict some resource on airflow executors themselves (rather than an external resource like a DB), e.g. some task uses 60% of cpu on a worker so we restrict that task's pool size to 1 to prevent two of the tasks from running on the same host. When backfilling a task of this type, now the backfill will wait on the pool to have slots open up before running the task even though we don't need to do this if backfilling on a different host outside of the pool. I think breaking this use case is OK since the use case is a hack due to not having a proper resource isolation solution (e.g. mesos should be used in this case instead).
  • To make things less confusing for users, force running a task will now override/ignore:
    • task instance's pool being full
    • execution date for a task instance being in the future
    • a task instance being in the retry waiting period
    • the task instance's task ending prior to the task instance's execution date
    • task instance is already queued
    • task instance has already succeeded
    • task instance is in the shutdown state
    • WILL NOT OVERRIDE task instance is already running
  • SLA miss emails will now include all tasks that did not finish for a particular DAG run, even if the tasks didn't run because depends_on_past was not met for a task
  • Failed tasks will no longer be considered "runnable", they must be force-run or cleared first from the UI to be run, just like successful tasks

I can revert these changes in behavior but it will make the code messier and I think it's a huge win to be consistent (both for code simplicity and for intuitiveness for users) across the board as much as possible when the question "should this task be run right now?" is asked.

Future Work (will not be released in the first version)

  • Show one more dagrun in the tree view (even when that dagrun's execution date hasn't occured yet) so that users can see why task instances in the next dagrun are blocked. I wanted to include this in the first release but it's non-trivial to do (one way to solve this is to generate DAG runs before their execution date occurs).
  • Break down failed dependency explanations better (e.g. if the trigger rules requiring all upstream task successes fails, indicate the specific upstream tasks that failed)
  • Make failing dependencies for task instances more visible to users (e.g. asyncrhonously display failing dependencies when mousing over a task instance in the tree view).
  • Parallelize task instance dependency checking.
  • Indicate that a task isn't running because it was manually cleared: if a task is manually cleared the scheduler won't automatically rerun it, but there is no way of knowing this for sure without some kind of flag that marks the task state as cleared.
  • Can potentially add tips to fix a failing dependency in the new view that shows failed task instance dependencies (e.g. for "depends_on_past" when the last task state is failing we can give the tip to get this task to succeed, or even provide a button to clear it right from that view)
  • Additional task instance dependencies, e.g. there are no schedulers running (a task can't get scheduled until a scheduler is running), or the airflow scheduler queue is backed up
  • Forcing tasks in the UI should give users feedback on why their task didn't get forced (e.g. task was already running)
  • Keep history of changes to blocked dependencies that are accessible via the UI (this is useful to know why a task's execution was delayed)

@mistercrunch @jlowin @plypaul

@landscape-bot
Copy link

Code Health
Repository health decreased by 1% when pulling 3a8a05a on ddavydov/why_isnt_my_task_running_view into 1af41d9 on master.

@jlowin
Copy link
Member

jlowin commented Apr 26, 2016

I'm really excited about this 👍

@aoen aoen force-pushed the ddavydov/why_isnt_my_task_running_view branch from 3a8a05a to 64ec80b Compare April 27, 2016 00:31
@landscape-bot
Copy link

Code Health
Repository health decreased by 0.70% when pulling 64ec80b on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.67% when pulling 57fd9e4 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.84% when pulling 1557a01 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

from sqlalchemy.ext.declarative import declarative_base, declared_attr
from sqlalchemy.dialects.mysql import LONGTEXT
from sqlalchemy.orm import relationship, synonym
from sqlalchemy.orm import reconstructor, relationship, synonym
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first time I see @reconstructor, SQLalchemy is one of these bottomless libs that keep on giving

Copy link
Contributor Author

@aoen aoen Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's pretty cool! I think at some point it would be nice to always have a "full" state of TaskInstance if possible (rather than having half of the properties filled in from the DB but missing the rest like task and vice versa when the task instance is constructed using the construcor), maybe by lazy loading properties (e.g. once a DB column is referenced call refresh_from_db automatically).

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.54% when pulling 6a22e75 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

return all(status.passed for status in
cls.get_dep_statuses(
ti,
session,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can look into the @utils.provide_session decorator that basically allows you to pass an active session, but if you don't is instantiate one and takes care of closing it at the end.

Copy link
Contributor Author

@aoen aoen Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I want to save you time by asking you not to go to deep with this review though (I had this issue on my TODO list already, along with improper use of session.commit in are_dependencies_met). I think just asking you to review the things in the "Things to review in this PR" section in the PR description would be more sensitive to your time, although I'm certainly not complaining about a more granular review since you could catch something I didn't already/wouldn't have noticed.

@mistercrunch
Copy link
Member

amazement
img

Notes:

  • I'd rather force not override "already running", unless it kills and cleans up first, and that's tricky, we'd probably need a new state "shutdown_and_restart", where the task picks up the shutdown, handles it, and marks itself as runnable somehow
  • SLA is really just about whether effectively you got your task to succeed by a certain time. IE: If core_data isn't deliver by 9AM, we have missed our SLA and should get an email
  • DAG pause evaluation is more efficient at a higher level, but I totally see how it makes the dependency engine nice to have it all in one place. People have asked for a task level pause state before and that would belong here nicely. Maybe DAG pause/unpause could effectively toggle all tasks level pause when that's implemented, but that's for another PR. I wouldn't worry about scheduler performance as we'll soon run multiple multiprocessing schedulers.

@mistercrunch
Copy link
Member

ignore_depends_on_past is kind of a nasty thing, it's mostly there to allow for remote backfills to allow for the start_date of the backfill to run

@mistercrunch
Copy link
Member

mistercrunch commented Apr 28, 2016

I haven't gotten to the bottom of this thought yet, but I feel like get_dep_statuses should only receive a task instance and maybe some run_context object (TBD). I'd hate to add a new run_context item and have to go copy paste it into all these methods that are not using it...

Trying to sort out the other args of get_dep_statuses:

  • session: needed, let's use @utils.provide_session
  • include_queued: haven't dug in this yet
  • ignore_depends_on_past: maybe part of some run context, it's related to the backfill's start_date
  • flag_upstream_failed: this should be insulated in some routine that's not dependency related, maybe something like "infer_states", that would just take care of all the cases where the state can just be inferred and assigned in place, without actually running the task

@aoen
Copy link
Contributor Author

aoen commented Apr 28, 2016

Much thanks for the in-depth review!

I'd rather force not override "already running", unless it kills and cleans up first, and that's tricky, we'd probably need a new state "shutdown_and_restart", where the task picks up the shutdown, handles it, and marks itself as runnable somehow

Is the scenario you are worrying about (two workers running the same task instance) already possible? For example if a worker's communication with the DB gets interrupted, then the scheduler assigns the task instance to a new worker, and then the communication between the initial worker and the DB resumes.

SLA is really just about whether effectively you got your task to succeed by a certain time. IE: If core_data isn't deliver by 9AM, we have missed our SLA and should get an email

This makes sense. I misspoke in the PR description though, SLAs should still be sent, the difference would be the SLA email would now omit task instances in the dagrun that didn't succeed for reasons other than depends_on_past not being met (e.g. a task that couldn't run because it's pool was full won't get reported in the email). I think I'm going to just include all tasks that don't have a successful status in the SLA miss email, even those stuck on depends_on_past to align with your criteria (if the task caused core_data to not be delivered by 9AM the task caused the DAG to miss it's SLA regardless of it's depends_on_past_dependency), plus is stops treating depends_on_past differently from the other dependencies like the pool being full. LMK what you think.

DAG pause evaluation is more efficient at a higher level, but I totally see how it makes the dependency engine nice to have it all in one place. People have asked for a task level pause state before and that would belong here nicely. Maybe DAG pause/unpause could effectively toggle all tasks level pause when that's implemented, but that's for another PR. I wouldn't worry about scheduler performance as we'll soon run multiple multiprocessing schedulers.

Agreed about the efficiency, was going to look into caching if this causes perf issues.

ignore_depends_on_past is kind of a nasty thing, it's mostly there to allow for remote backfills to allow for the start_date of the backfill to run

The newfound power of the force flag could be used instead of ignore_depends_on_past, but making "force" the default for every backfill could potentially be a bit dangerous as users could e.g. unintentionally force run over a large range of already successful tasks in a backfill or violate a pool constraint. If you have any ideas let me know.

I haven't gotten to the bottom of this thought yet, but I feel like get_dep_statuses should only receive a task instance and maybe some run_context object (TBD). I'd hate to add a new run_context item and have to go copy paste it into all these methods that are not using it...
Trying to sort out the other args of get_dep_statuses:
session: needed, let's use @utils.provide_session
include_queued: haven't dug in this yet
ignore_depends_on_past: maybe part of some run context, it's related to the backfill's start_date
flag_upstream_failed: this should be insulated in some routine that's not dependency related, maybe something like "infer_states", that would just take care of all the cases where the state can just be inferred and assigned in place, without actually running the task

Agreed about not passing in a bunch of different flags. There is actually a TODO above that part of the code in the PR to use a context parameter instead (it will be addressed in this PR).

For the flag upstream_failed I would prefer to leave the fix for another PR since it was an existing hack and the cope of this PR is already a bit dangerously large.

@landscape-bot
Copy link

Code Health
Repository health decreased by 1% when pulling c26c798 on ddavydov/why_isnt_my_task_running_view into f1ff65c on master.

@aoen aoen closed this May 19, 2016
@asfgit asfgit deleted the ddavydov/why_isnt_my_task_running_view branch January 4, 2017 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants