Skip to content

Conversation

@AutomationDev85
Copy link
Contributor

@AutomationDev85 AutomationDev85 commented Apr 18, 2023

Hi airflow community,
this is my third PR and be happy to work on the scheduler runtime again. We faced an issue with slow scheduler execution time by having millions of queued dag_runs for one DAG.

Our DAG which runs millions of task is triggered external and has no interval or timetable. We think that DAGs which are running with interval or timetable will not create huge amount of dag_runs in queued state. But the idea is to improve the performance for the dag_runs for DAGs which have no interval or timetable by skipping the execution of calc_num_active_runs and _should_update_dag_next_dagruns if there is not timetable or interval. This also helped us to improve the scheduler execution time.

So I´m still new to Airflow coding and try to get thinks right. I hope it is possible to understand the idea behind the improvement. I´m open and looking for a nicer code to check for this, maybe you have nicer code solution to check for a DAG which has timetable or interval.

@vandonr-amz fyi, as discussed with @jens-scheffler-bosch

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Apr 18, 2023
Copy link
Contributor

@vandonr-amz vandonr-amz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea and code look good to me, but it's really a part of the code I don't know well, so idk if there are deeper implications.

@potiuk
Copy link
Member

potiuk commented Apr 22, 2023

Why do you have dags with no interval nor timetable @AutomationDev85 ?

@jedcunningham
Copy link
Member

Why do you have dags with no interval nor timetable @AutomationDev85 ?

The DAGs are triggered externally. Maybe via api or the trigger operator.

High level this makes sense, though a nit pick that it feels like the short circuit should happen in _should_update_dag_next_dagruns. You'd need to do a little refactoring, move getting the active dagrun count, to allow skipping that, but I think that'd be cleaner.

Can you also add a test for this too? Thanks.

@potiuk
Copy link
Member

potiuk commented Apr 23, 2023

Why do you have dags with no interval nor timetable @AutomationDev85 ?

The DAGs are triggered externally. Maybe via api or the trigger operator.

Ah yeah. That makes sense ... I re read "Our DAG which trigger million of tasks" again. That was hyperbole (I hope) but I was under the impression there are many "runs" of the same DAG not a single DAG that produces huge number of tasks. That makes perfect sense now..

High level this makes sense, though a nit pick that it feels like the short circuit should happen in _should_update_dag_next_dagruns. You'd need to do a little refactoring, move getting the active dagrun count, to allow skipping that, but I think that'd be cleaner.
Can you also add a test for this too? Thanks.

Yep. that's better and test would be nice indeed.

@jscheffl
Copy link
Contributor

Why do you have dags with no interval nor timetable @AutomationDev85 ?

The DAGs are triggered externally. Maybe via api or the trigger operator.

Ah yeah. That makes sense ... I re read "Our DAG which trigger million of tasks" again. That was hyperbole (I hope) but I was under the impression there are many "runs" of the same DAG not a single DAG that produces huge number of tasks. That makes perfect sense now..

So yes, actually to be precise the DAG has schedule=None in the definition. But in this case still some kind of timetable with no information is added to the DAG instance.
Reason for the optimization was especially even if no schedule is defined, still the scheduler tries to look for every DAG completion if a follow-up timetable needs to be executed behind. So with this optimization just the follow-up scheduling attempt is skipped.

High level this makes sense, though a nit pick that it feels like the short circuit should happen in _should_update_dag_next_dagruns. You'd need to do a little refactoring, move getting the active dagrun count, to allow skipping that, but I think that'd be cleaner.
Can you also add a test for this too? Thanks.

Yep. that's better and test would be nice indeed.

I did not get this :-) but will talk with @AutomationDev85 tomorrow and hope he get's the hint :-D
What I remember from coding was that due to the introduction of timetables still a lot of complexity was in scheduling code from new timetable as well as "legacy standard" scheduling and different objects exists in parallel. But we wanted to prevent a major rework on this.

@jedcunningham
Copy link
Member

Sounds good. If you need more (or better) hints, don't be shy. And as another free hint, feel free to change the signature of _should_update_dag_next_dagruns, it doesn't need to be backwards compatible.

Copy link
Contributor Author

@AutomationDev85 AutomationDev85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code. Thanks for your feedback @jedcunningham . Hope I got your idea right!

Comment on lines 1311 to 1294
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple issues and questions I’m going to be lazy and merge

  1. The schedule_interval is simply a string describing the timetable, checking against it is not meaningful.
  2. The argument is very weird, especially with the -1 default (and why is that needed really?)
  3. How would this work if a user subclass NullTimetable?
  4. A flag on the timetable class would be preferred over isinstance. See how things like periodic are interfaced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your input!

  1. Any idea what is the best case to check if a dag is only triggered and not scheduled? I was not finding a nicer way to check for this.
  2. I wanted to catch the use case of the function _create_dag_runs were the number of active dags is increase outside of the function.
  3. Why should a user subclass NullTimetable. I did not get why NullTimetable exists and not setting this to None. Can you help me to understand this? i though NullTimeTable is used if no timetable exists.
  4. Will check this out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do you mean if a DAG can only be manually triggered? That would be dag.timetable.can_run.
  2. Not sure I’m getting what you mean on this one.
  3. Using None would make other parts of the code a lot more complicated because everything everywhere needs to check for None. This is called the null object pattern.

Copy link
Contributor Author

@AutomationDev85 AutomationDev85 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Nice that is what I was searching for !! Adapted the code accordingly. But during preparation of next PR if found that also the OnceTimetable has also the Flag set to False. So only the NullTimetable should be skipped here so I will switch back to check for NullTimetable check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new attribute that works for this. Checking for class identity should generally be avoided for polymorphism, which flags like can_run provide.

Copy link
Contributor

@jscheffl jscheffl May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the feedback. I am thinking about how to apply the flag w/o adding more complexity in attributes. Do you think rather (1) we should add a new attribute just for this use case to the Timetable class? I was thinking but can not imagine a good self-descriptive name other than can run. So (2) would it be probably meaningful to change the default of can_runonly on NullTimetableand OnceTimetable? Because actually when taking a look to the code it is rather mis-placed in ContinuousTimetable - but I am not sure what other side effects might be?

Mhm, especially when looking at airflow/models/dag.py:3125 I feel like can_run is not correct in validation for ContinuousTimetable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, the more I think of it I assume can_run is not the right word, If I take a look to the code it is rather matching to the meaning of is_scheduled? Because otherwise would mean like only_manual_triggered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe schedulable or can_be_scheduled might make sense? Or (if schedule is ambiguous) only_manually_triggered can work as well. The flag is internal to Airflow (not documented for end users to use) so anything that’s sufficiently descriptive should be OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @uranusjr Thanks (also here) for the feedback - tried to apply the change and renamed can_run to can_be_scheduled and checked logic where it is used and re-factored code pieced.
To have an effect of the flag, DatasetTriggeredTimetable does not inherit from NullTimetable.

PR is now way more complex than before, I hope it is not shooting in the wrong direction (but at least is not using isinstance() anymore :-D

@potiuk
Copy link
Member

potiuk commented Apr 29, 2023

LGTM. @uranusjr ?

Comment on lines 125 to 144
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ll probably need to add some kind of compatibility layer for this, since can_run is technically public interface and may be implemented by an existing timetable. Something that emits a deprecation warning if can_run is implemented on a timetable but can_be_scheduled is not (and forward can_run to can_be_scheduled).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a deprecation handling for the field - is this like you want to have it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other way around is needed—since it’s Airflow that accesses the value, and the user implements the timetable, we need to detect when can_be_scheduled is accessed, and emit a warning when can_run is defined to a different value.

It’d also probably be best to not use __getattribute__ since the function would be called on every attribute access and slow down timetable access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the part with the deprecation warning and use now a property to return a warning if can_run is used in a timetable instead of can_be_scheduled.

@AutomationDev85 AutomationDev85 force-pushed the feature/optimize-scheduler-run-time-3 branch from c1a73cd to 22d0b1a Compare May 25, 2023 12:24
@uranusjr uranusjr merged commit ec18db1 into apache:main May 30, 2023
@eladkal eladkal added this to the Airflow 2.6.2 milestone Jun 8, 2023
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Jun 8, 2023
eladkal pushed a commit that referenced this pull request Jun 8, 2023
Co-authored-by: Jens Scheffler <jens.scheffler@de.bosch.com>
(cherry picked from commit ec18db1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants