Skip to content

Conversation

@kzosabe
Copy link

@kzosabe kzosabe commented Oct 4, 2023

closes: #34493
conversations in slack: https://apache-airflow.slack.com/archives/CCPRP7943/p1696235301762429

draft PR for the reviewer's second suggestion in the original PR (#34589)

As mentioned in the issue, there may be a risk in changing the behavior of start_date itself, so I took the approach of to not call this function if the start date is None.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Oct 4, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 4, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@kzosabe
Copy link
Author

kzosabe commented Oct 4, 2023

About this advice in the slack conversation:

we used it also in the backfill job, you need to do the same thing.

As the description of the previous PR, there are two points where TI transitions to QUEUED.
As you pointed out, ti is indeed transitioning to QUEUED at backfill_job_runner.py: 533 too

However, ti is not directly triggered here and emit_state_change_metric is not called too at this time, it seemed unnecessary to change behavior at this point.
I think the problem is only in the behavior of this function. And by grep the entire repository with emit_state_change_metric, there was only one place that could be called with the argument queued. So I think currently there should be no other point to change, is this correct?

@uranusjr
Copy link
Member

uranusjr commented Oct 5, 2023

In the other place where emit_state_change_metric is called, start_date is simply set prior in the code path. The path in SchedulerJobRunner should do the same—set start_date to the current time if it is None.

@kzosabe
Copy link
Author

kzosabe commented Oct 5, 2023

@uranusjr Thanks for pointing out, so there may be no problem to set start_date at this code path?
When I checked this problem, almost all scheduled task execution in Airflow 2.7.0+ had this false warning.
It means this change will affect start-date-related behaviour all of them.
But if from commiter's point of view it will be ok, of cource I'll do so and it will be great for this metric too.
I'll try to add a commit in this approach in the first.

@kzosabe
Copy link
Author

kzosabe commented Oct 5, 2023

Note that from the user's point of view, emitting warning for each task execution may generate heavy amount of system logs especially in environment has large amout of dags or frequently scheduled dags.
So if I can say my hope, it will be good to remove quickly wrong warning in the first e.g. merge @hussein-awala 's PR, next to apply this approach. Can we do this...? If this does not seem to fit the style of this repository, that is of course fine.

@kzosabe
Copy link
Author

kzosabe commented Oct 6, 2023

As I understand this change (errors may be included, so please point them out):

  • prior to this PR, ti.start_date was typically set during the RUNNING phase in most cases
    • the backfill is one of the exceptions, and there could be other cases as well
  • after this PR, ti.start_date will be set during the QUEUED phase in most cases
  • the time duration between QUEUED and RUNNING may be long in some cases, such as all workers are busy

Therefore there are several considerations regarding the change in behavior for ti.start_date:

  • some sql queries currently use ti.start_date
    • the query results may change, for example result ordering, calucurated task duration, and so on
  • some users who have been considering start_date as the started running time of task may be confused by this change

I'm also concerned about whether we should treat the ti.start_date as queued_date or started_running_date .
What are your thoughts on this within the community?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this change is not persisted due to make_transient applied afterward,
but I want to know whether we should modify and persist the start_date value here or not before implementing fix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-investigation, I think this approach is not good because ti.start_date currently seems to be used as started_running_date, and this change (set queued time into ti.start_date) may cause breaking changes in many functions of this OSS.
If we want to make this logging correct, we have to create a new column in the TaskInstance class such as ti.scheduled_dttm, but this will be a long journey.
This is why I'm consistently want to stop the wrong warning first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i don't think it makes sense to set the start_date when the task hasn't actually started. the problem seems to be elsewhere.

@potiuk potiuk force-pushed the skip_metrics_if_no_start_date branch from 1f3b85b to 6e7379f Compare October 15, 2023 21:00
@eladkal
Copy link
Contributor

eladkal commented Oct 27, 2023

Is this PR ready for review @kzosabe ?

@kzosabe
Copy link
Author

kzosabe commented Oct 29, 2023

Yes, but I would like some advice regarding the points I mentioned above. @eladkal @uranusjr

@kzosabe kzosabe marked this pull request as ready for review October 29, 2023 04:27
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 14, 2023
@github-actions github-actions bot closed this Dec 19, 2023
@HadarSha
Copy link
Contributor

@kzosabe @eladkal @uranusjr @kaxil @ashb @XD-DENG We are still seeing this issue in Airflow 2.8.0 - is there any chance of merging this fix or otherwise alleviating this issue?

@kzosabe
Copy link
Author

kzosabe commented Dec 19, 2023

I would like that too and am ready to do necessary actions in merging this PR. However, I have yet to receive a response.
Can we merge #34589 or make this pull request equivalent? If so, I will make the change asap. This should solve the problem in a simple and harmless way.
If that is not allowed, I would like to know the community's policy for resolving the issues raised above, and I'll do best to achieve that.

@hussein-awala hussein-awala reopened this Dec 19, 2023
@hussein-awala hussein-awala removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 19, 2023
@hussein-awala hussein-awala self-requested a review December 19, 2023 10:27
@hussein-awala
Copy link
Member

I would like that too and am ready to do necessary actions in merging this PR. However, I have yet to receive a response. Can we merge #34589 or make this pull request equivalent? If so, I will make the change asap. This should solve the problem in a simple and harmless way. If that is not allowed, I would like to know the community's policy for resolving the issues raised above, and I'll do best to achieve that.

I will review it tonight and help to merge it.

@kzosabe kzosabe force-pushed the skip_metrics_if_no_start_date branch from 6e7379f to b18c0b4 Compare December 20, 2023 11:36
@HadarSha
Copy link
Contributor

Any chance of remedy regarding these logs? we are unable to assist @kzosabe with the points he had brought up, and this issue has been preventing us from upgrading Airflow for months now :(

It seems that the sheer volume of the logs itself (we have almost 100k tasks daily) being constantly printed in mass, kills the schedulers and causes a significant memory overhead we cannot account for in the new versions.

Perhaps there is a way to just suppress these specific logs somehow without fixing the underlying issue/implementation, even as a temporary flag or override? @eladkal @hussein-awala @uranusjr

Thanks a lot!

@uranusjr
Copy link
Member

uranusjr commented Jan 22, 2024

  • after this PR, ti.start_date will be set during the QUEUED phase in most cases
  • the time duration between QUEUED and RUNNING may be long in some cases, such as all workers are busy

I also have the same impression reading the changes again directly. I wonder why emit_state_change_metric would check for start_date when the task instance is changed to QUEUED. That seems unintuitive for the same reason. The warning message read:

cannot record %s for task %s because previous state change time has not been saved

(I added the emphasis)

start_date is set when the task starts running, but when you change the state to QUEUED, the task has not started running. Why does start_date being unset indicates a previous state change? The entire situation seems more awkward than I initially anticipated.

The warning was added in #30612 by @vandonr-amz so it may be best to get some context from the source.

@gilt-cl
Copy link

gilt-cl commented Mar 14, 2024

@kzosabe @hussein-awala @uranusjr
Is there any progress regarding this issue?

@kzosabe
Copy link
Author

kzosabe commented Mar 15, 2024

@gil-tober I'm ready to take the necessary actions but have not yet received a response.

To simplify the problem, I re-submitted the PR to remove this incorrect warning.
This is almost the same as #34589 but this approach is at least harmless and certainly solves the problem.

Remove incorrect warning about scheduled_duration metric #38180

A long time has passed. There are several users who are having problems with this issue. I hope someone will review this PR and approve it.
@hussein-awala @uranusjr @eladkal

@uranusjr
Copy link
Member

As mentioned above, it would be best to try to reach to original author to the code for clarification.

@kaxil kaxil added this to the Airflow 2.9.0 milestone Mar 27, 2024
Comment on lines +621 to +623
# ti.start_date could be None when the scheduler queue a TI
# or when the backfill CLI send a TI to the executor
# in this case set it at this line because emit_state_change_metric doesn't expect it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can start_date not be None? A TI being queued by the scheduler is arguably the most canonical way to run things, so arguably emit_state_change_metric should adapt to that possibility instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can start_date not be None?

This call only occurs when the TI.state transitions from scheduled to queued, so normally start_date is None.
One exception that I and hussein-awala mentioned was backfill.

Currently, when set_state occurs, ti.start_date stores the value regardless of state.

ti.start_date = ti.start_date or current_time

In backfill, set_state is used to rewind the state to scheduled, resulting in a ti of scheduled where start_date is stored.

ti.set_state(TaskInstanceState.SCHEDULED)

This is an implementation error in set_state, as it fails to take into account the need to revert to the pre-running state.

Except for the bug mentioned above, the log was implemented incorrectly from the beginning, since in practice it is not inherently possible for start_date not to be None at this location.
As mentioned, it should not be possible to implement an equivalent log correctly unless a field like scheduled_dttm is implemented.

@kzosabe
Copy link
Author

kzosabe commented Mar 28, 2024

After a long time we were able to successfully merge #38180 and resolve this issue.
Thanks especially to @HadarSha and @gil-tober for pushing this to reviewers.

@uranusjr @hussein-awala
As mentioned in #34771 (comment) at least this PR's approach is wrong so we close this.
If you wish to address the root of the problem including adding ti.scheduled_dttm , please reopen #34493 or create a new issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

emit_state_change_metric produces many warnings even in standard use cases

9 participants