First draft of blocked task dependency explainer#1435
Conversation
|
|
|
I'm really excited about this 👍 |
3a8a05a to
64ec80b
Compare
|
|
|
|
|
|
| from sqlalchemy.ext.declarative import declarative_base, declared_attr | ||
| from sqlalchemy.dialects.mysql import LONGTEXT | ||
| from sqlalchemy.orm import relationship, synonym | ||
| from sqlalchemy.orm import reconstructor, relationship, synonym |
There was a problem hiding this comment.
first time I see @reconstructor, SQLalchemy is one of these bottomless libs that keep on giving
There was a problem hiding this comment.
Agreed, it's pretty cool! I think at some point it would be nice to always have a "full" state of TaskInstance if possible (rather than having half of the properties filled in from the DB but missing the rest like task and vice versa when the task instance is constructed using the construcor), maybe by lazy loading properties (e.g. once a DB column is referenced call refresh_from_db automatically).
|
|
| return all(status.passed for status in | ||
| cls.get_dep_statuses( | ||
| ti, | ||
| session, |
There was a problem hiding this comment.
You can look into the @utils.provide_session decorator that basically allows you to pass an active session, but if you don't is instantiate one and takes care of closing it at the end.
There was a problem hiding this comment.
Noted, I want to save you time by asking you not to go to deep with this review though (I had this issue on my TODO list already, along with improper use of session.commit in are_dependencies_met). I think just asking you to review the things in the "Things to review in this PR" section in the PR description would be more sensitive to your time, although I'm certainly not complaining about a more granular review since you could catch something I didn't already/wouldn't have noticed.
|
Notes:
|
|
|
|
I haven't gotten to the bottom of this thought yet, but I feel like Trying to sort out the other args of
|
|
Much thanks for the in-depth review!
Is the scenario you are worrying about (two workers running the same task instance) already possible? For example if a worker's communication with the DB gets interrupted, then the scheduler assigns the task instance to a new worker, and then the communication between the initial worker and the DB resumes.
This makes sense. I misspoke in the PR description though, SLAs should still be sent, the difference would be the SLA email would now omit task instances in the dagrun that didn't succeed for reasons other than depends_on_past not being met (e.g. a task that couldn't run because it's pool was full won't get reported in the email). I think I'm going to just include all tasks that don't have a successful status in the SLA miss email, even those stuck on depends_on_past to align with your criteria (if the task caused core_data to not be delivered by 9AM the task caused the DAG to miss it's SLA regardless of it's depends_on_past_dependency), plus is stops treating depends_on_past differently from the other dependencies like the pool being full. LMK what you think.
Agreed about the efficiency, was going to look into caching if this causes perf issues.
The newfound power of the force flag could be used instead of ignore_depends_on_past, but making "force" the default for every backfill could potentially be a bit dangerous as users could e.g. unintentionally force run over a large range of already successful tasks in a backfill or violate a pool constraint. If you have any ideas let me know.
Agreed about not passing in a bunch of different flags. There is actually a TODO above that part of the code in the PR to use a context parameter instead (it will be addressed in this PR). For the flag upstream_failed I would prefer to leave the fix for another PR since it was an existing hack and the cope of this PR is already a bit dangerously large. |
|
|

THIS IS A DRAFT WIP PULL REQUEST
This is a draft for a solution to #1383 among several other things.
There are tons of things broken/missing here, I'm just putting this out there to get some initial eyes on the high-level changes.
Goals
e.g. (this will not be the final product):
Things to review in this PR
Things NOT to review in this PR
Some big changes made in this PR
This will break one use case:
Using pools to restrict some resource on airflow executors themselves (rather than an external resource like a DB), e.g. some task uses 60% of cpu on a worker so we restrict that task's pool size to 1 to prevent two of the tasks from running on the same host. When backfilling a task of this type, now the backfill will wait on the pool to have slots open up before running the task even though we don't need to do this if backfilling on a different host outside of the pool. I think breaking this use case is OK since the use case is a hack due to not having a proper resource isolation solution (e.g. mesos should be used in this case instead).
I can revert these changes in behavior but it will make the code messier and I think it's a huge win to be consistent (both for code simplicity and for intuitiveness for users) across the board as much as possible when the question "should this task be run right now?" is asked.
Future Work (will not be released in the first version)
@mistercrunch @jlowin @plypaul