-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Suppress false warning when TI state is QUEUED and TI doesn't have a start_date #34771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
|
About this advice in the slack conversation:
As the description of the previous PR, there are two points where TI transitions to QUEUED. However, ti is not directly triggered here and emit_state_change_metric is not called too at this time, it seemed unnecessary to change behavior at this point. |
|
In the other place where |
|
@uranusjr Thanks for pointing out, so there may be no problem to set start_date at this code path? |
|
Note that from the user's point of view, emitting warning for each task execution may generate heavy amount of system logs especially in environment has large amout of dags or frequently scheduled dags. |
|
As I understand this change (errors may be included, so please point them out):
Therefore there are several considerations regarding the change in behavior for ti.start_date:
I'm also concerned about whether we should treat the |
airflow/jobs/scheduler_job_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that this change is not persisted due to make_transient applied afterward,
but I want to know whether we should modify and persist the start_date value here or not before implementing fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After re-investigation, I think this approach is not good because ti.start_date currently seems to be used as started_running_date, and this change (set queued time into ti.start_date) may cause breaking changes in many functions of this OSS.
If we want to make this logging correct, we have to create a new column in the TaskInstance class such as ti.scheduled_dttm, but this will be a long journey.
This is why I'm consistently want to stop the wrong warning first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i don't think it makes sense to set the start_date when the task hasn't actually started. the problem seems to be elsewhere.
1f3b85b to
6e7379f
Compare
|
Is this PR ready for review @kzosabe ? |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
|
I would like that too and am ready to do necessary actions in merging this PR. However, I have yet to receive a response. |
I will review it tonight and help to merge it. |
6e7379f to
b18c0b4
Compare
|
Any chance of remedy regarding these logs? we are unable to assist @kzosabe with the points he had brought up, and this issue has been preventing us from upgrading Airflow for months now :( It seems that the sheer volume of the logs itself (we have almost 100k tasks daily) being constantly printed in mass, kills the schedulers and causes a significant memory overhead we cannot account for in the new versions. Perhaps there is a way to just suppress these specific logs somehow without fixing the underlying issue/implementation, even as a temporary flag or override? @eladkal @hussein-awala @uranusjr Thanks a lot! |
I also have the same impression reading the changes again directly. I wonder why
(I added the emphasis)
The warning was added in #30612 by @vandonr-amz so it may be best to get some context from the source. |
|
@kzosabe @hussein-awala @uranusjr |
|
@gil-tober I'm ready to take the necessary actions but have not yet received a response. To simplify the problem, I re-submitted the PR to remove this incorrect warning. Remove incorrect warning about scheduled_duration metric #38180 A long time has passed. There are several users who are having problems with this issue. I hope someone will review this PR and approve it. |
|
As mentioned above, it would be best to try to reach to original author to the code for clarification. |
| # ti.start_date could be None when the scheduler queue a TI | ||
| # or when the backfill CLI send a TI to the executor | ||
| # in this case set it at this line because emit_state_change_metric doesn't expect it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can start_date not be None? A TI being queued by the scheduler is arguably the most canonical way to run things, so arguably emit_state_change_metric should adapt to that possibility instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can start_date not be None?
This call only occurs when the TI.state transitions from scheduled to queued, so normally start_date is None.
One exception that I and hussein-awala mentioned was backfill.
Currently, when set_state occurs, ti.start_date stores the value regardless of state.
airflow/airflow/models/taskinstance.py
Line 1885 in 5c7b3e9
| ti.start_date = ti.start_date or current_time |
In backfill, set_state is used to rewind the state to scheduled, resulting in a ti of scheduled where start_date is stored.
airflow/airflow/jobs/backfill_job_runner.py
Line 428 in 77341ef
| ti.set_state(TaskInstanceState.SCHEDULED) |
This is an implementation error in set_state, as it fails to take into account the need to revert to the pre-running state.
Except for the bug mentioned above, the log was implemented incorrectly from the beginning, since in practice it is not inherently possible for start_date not to be None at this location.
As mentioned, it should not be possible to implement an equivalent log correctly unless a field like scheduled_dttm is implemented.
|
After a long time we were able to successfully merge #38180 and resolve this issue. @uranusjr @hussein-awala |
closes: #34493
conversations in slack: https://apache-airflow.slack.com/archives/CCPRP7943/p1696235301762429
draft PR for the reviewer's second suggestion in the original PR (#34589)
As mentioned in the issue, there may be a risk in changing the behavior of start_date itself, so I took the approach of to not call this function if the start date is None.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.