-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Save scheduler execution time by checking if DAG has interval or timetable #30706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save scheduler execution time by checking if DAG has interval or timetable #30706
Conversation
vandonr-amz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea and code look good to me, but it's really a part of the code I don't know well, so idk if there are deeper implications.
|
Why do you have dags with no interval nor timetable @AutomationDev85 ? |
The DAGs are triggered externally. Maybe via api or the trigger operator. High level this makes sense, though a nit pick that it feels like the short circuit should happen in Can you also add a test for this too? Thanks. |
Ah yeah. That makes sense ... I re read "Our DAG which trigger million of tasks" again. That was hyperbole (I hope) but I was under the impression there are many "runs" of the same DAG not a single DAG that produces huge number of tasks. That makes perfect sense now..
Yep. that's better and test would be nice indeed. |
So yes, actually to be precise the DAG has
I did not get this :-) but will talk with @AutomationDev85 tomorrow and hope he get's the hint :-D |
|
Sounds good. If you need more (or better) hints, don't be shy. And as another free hint, feel free to change the signature of |
AutomationDev85
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code. Thanks for your feedback @jedcunningham . Hope I got your idea right!
airflow/jobs/scheduler_job_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple issues and questions I’m going to be lazy and merge
- The
schedule_intervalis simply a string describing the timetable, checking against it is not meaningful. - The argument is very weird, especially with the
-1default (and why is that needed really?) - How would this work if a user subclass NullTimetable?
- A flag on the timetable class would be preferred over
isinstance. See how things likeperiodicare interfaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your input!
- Any idea what is the best case to check if a dag is only triggered and not scheduled? I was not finding a nicer way to check for this.
- I wanted to catch the use case of the function _create_dag_runs were the number of active dags is increase outside of the function.
- Why should a user subclass NullTimetable. I did not get why NullTimetable exists and not setting this to None. Can you help me to understand this? i though NullTimeTable is used if no timetable exists.
- Will check this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do you mean if a DAG can only be manually triggered? That would be
dag.timetable.can_run. - Not sure I’m getting what you mean on this one.
- Using
Nonewould make other parts of the code a lot more complicated because everything everywhere needs to check for None. This is called the null object pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Nice that is what I was searching for !! Adapted the code accordingly. But during preparation of next PR if found that also the OnceTimetable has also the Flag set to False. So only the NullTimetable should be skipped here so I will switch back to check for NullTimetable check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new attribute that works for this. Checking for class identity should generally be avoided for polymorphism, which flags like can_run provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for the feedback. I am thinking about how to apply the flag w/o adding more complexity in attributes. Do you think rather (1) we should add a new attribute just for this use case to the Timetable class? I was thinking but can not imagine a good self-descriptive name other than can run. So (2) would it be probably meaningful to change the default of can_runonly on NullTimetableand OnceTimetable? Because actually when taking a look to the code it is rather mis-placed in ContinuousTimetable - but I am not sure what other side effects might be?
Mhm, especially when looking at airflow/models/dag.py:3125 I feel like can_run is not correct in validation for ContinuousTimetable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm, the more I think of it I assume can_run is not the right word, If I take a look to the code it is rather matching to the meaning of is_scheduled? Because otherwise would mean like only_manual_triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe schedulable or can_be_scheduled might make sense? Or (if schedule is ambiguous) only_manually_triggered can work as well. The flag is internal to Airflow (not documented for end users to use) so anything that’s sufficiently descriptive should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @uranusjr Thanks (also here) for the feedback - tried to apply the change and renamed can_run to can_be_scheduled and checked logic where it is used and re-factored code pieced.
To have an effect of the flag, DatasetTriggeredTimetable does not inherit from NullTimetable.
PR is now way more complex than before, I hope it is not shooting in the wrong direction (but at least is not using isinstance() anymore :-D
|
LGTM. @uranusjr ? |
airflow/timetables/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’ll probably need to add some kind of compatibility layer for this, since can_run is technically public interface and may be implemented by an existing timetable. Something that emits a deprecation warning if can_run is implemented on a timetable but can_be_scheduled is not (and forward can_run to can_be_scheduled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a deprecation handling for the field - is this like you want to have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other way around is needed—since it’s Airflow that accesses the value, and the user implements the timetable, we need to detect when can_be_scheduled is accessed, and emit a warning when can_run is defined to a different value.
It’d also probably be best to not use __getattribute__ since the function would be called on every attribute access and slow down timetable access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the part with the deprecation warning and use now a property to return a warning if can_run is used in a timetable instead of can_be_scheduled.
…d, change inheritance of DatasetTriggeredTimetable
c1a73cd to
22d0b1a
Compare
Co-authored-by: Jens Scheffler <jens.scheffler@de.bosch.com> (cherry picked from commit ec18db1)
Hi airflow community,
this is my third PR and be happy to work on the scheduler runtime again. We faced an issue with slow scheduler execution time by having millions of queued dag_runs for one DAG.
Our DAG which runs millions of task is triggered external and has no interval or timetable. We think that DAGs which are running with interval or timetable will not create huge amount of dag_runs in queued state. But the idea is to improve the performance for the dag_runs for DAGs which have no interval or timetable by skipping the execution of calc_num_active_runs and _should_update_dag_next_dagruns if there is not timetable or interval. This also helped us to improve the scheduler execution time.
So I´m still new to Airflow coding and try to get thinks right. I hope it is possible to understand the idea behind the improvement. I´m open and looking for a nicer code to check for this, maybe you have nicer code solution to check for a DAG which has timetable or interval.
@vandonr-amz fyi, as discussed with @jens-scheffler-bosch