Skip to content

Conversation

@Bowrna
Copy link
Contributor

@Bowrna Bowrna commented Jun 19, 2023

closes: #31810
relates: #31810


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jun 19, 2023
@Bowrna Bowrna changed the title heartbeat log fix heartbeat logmessage fix Jun 19, 2023
@Bowrna
Copy link
Contributor Author

Bowrna commented Jun 19, 2023

Some of the suggestions provided in the issue by @hterik


A: Don't log the whole stack trace, instead just log this as an error message.
B: Include a descriptive message of the consequence of this error in the log.
C: When heartbeat recovers, log it.
D: If heartbeat was recovered after scheduler_health_check_threshold, log another more severe error message including descriptive message of the consequences.

the current changes cover A, B and D. I will figure out how I can find if the heartbeat is recovering and finish C part.

@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 0be788d to 33c0427 Compare July 5, 2023 13:05
@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 33c0427 to 98ae62a Compare July 19, 2023 04:13
@Bowrna Bowrna marked this pull request as ready for review July 19, 2023 04:45
@Bowrna Bowrna requested review from XD-DENG, ashb and kaxil as code owners July 19, 2023 04:45
@Bowrna
Copy link
Contributor Author

Bowrna commented Jul 19, 2023

@kaxil @ashb @XD-DENG Please review this PR and let me know about your views.

Copy link
Contributor

@hterik hterik Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user of airflow reading the logs of a dag i would not understand what this means to me. Is this something i have to react to? Do i need to contact my admins? Is the dag results corrupted? Should i restart the scheduler?

This error isn't necessarily a problem with the scheduler. More often it is a problem of the executor not being able to reach the database, due to transient network problems. As long as this error is transient and recovers shortly, the consequence of this is usually none. The log message should reflect this. If this is too much to fit into a log-message, linking to the architecture documentation at airflow.apache.org as suggested by potiuk above sounds like a good proposal.

@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 98ae62a to 53e0f12 Compare July 31, 2023 04:35
@Bowrna
Copy link
Contributor Author

Bowrna commented Jul 31, 2023

@potiuk In this PR, I am specifically working on the heartbeat failure when updating the table about the heartbeat timestamp. So its more related to db problems when executor wants to update the table. But with above comments, I think the scope for this issue is expanding. Is my understanding right @potiuk ?

@potiuk
Copy link
Member

potiuk commented Aug 6, 2023

@potiuk In this PR, I am specifically working on the heartbeat failure when updating the table about the heartbeat timestamp. So its more related to db problems when executor wants to update the table. But with above comments, I think the scope for this issue is expanding. Is my understanding right @potiuk ?

That's the proposal - but it's just suggestion, if you want to stay with small change - up to you.

@Bowrna
Copy link
Contributor Author

Bowrna commented Aug 7, 2023

@potiuk thanks for the update. Let me break the proposal into a few more tasks and work on it.

@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 53e0f12 to 283a9f2 Compare September 1, 2023 09:01
@potiuk potiuk force-pushed the heartbeat-log-message-fix branch from 283a9f2 to 827c8a2 Compare September 3, 2023 20:40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# self.log.exception("%s heartbeat got an exception", self.__class__.__name__)

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM: with the small NIT of removing the commented out line

@Bowrna
Copy link
Contributor Author

Bowrna commented Sep 5, 2023

thanks @potiuk let me fix it

@Bowrna Bowrna closed this Sep 7, 2023
@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 827c8a2 to 22de77e Compare September 7, 2023 07:52
@Bowrna
Copy link
Contributor Author

Bowrna commented Sep 7, 2023

I wrongly did a git force push with main branch changes. Please excuse. I will clean up and commit the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve log messages of heartbeat connection errors and recovery

4 participants