Skip to content

Conversation

@perry2of5
Copy link
Contributor

@perry2of5 perry2of5 commented Mar 30, 2025

This handles overflow when calculating the next execution time for a task instance by falling back to the configured maximum delay. The solution uses the same strategy that tenacity uses:
https://github.com/jd/tenacity/blob/main/tenacity/wait.py#L167

An alternate solution would be the determine the maximum tries that wouldn't exceed the maximum delay and then not calculate the timeout for values larger than that.

Something like

max_delay = self.task.max_retry_delay if self.task.max_retry_delay is not null else MAX_RETRY_DELAY
tries_before_max_delay = math.floor(math.log2(max_delay))
if self.try_number <= tries_before_max_delay:
     # existing logic
else:
    delay = max_delay

closes: #47971

@perry2of5 perry2of5 requested review from XD-DENG and ashb as code owners March 30, 2025 22:25
@perry2of5 perry2of5 changed the title Use max delay to handle overflow in TaskInstance next_retry_datetime 47971 Use max delay to handle overflow in TaskInstance next_retry_datetime fixes #47971 Mar 30, 2025
@perry2of5 perry2of5 changed the title Use max delay to handle overflow in TaskInstance next_retry_datetime fixes #47971 Use max delay to handle overflow in TaskInstance next_retry_datetime fixes 47971 Mar 30, 2025
@perry2of5
Copy link
Contributor Author

If you turn off whitespace changes, only 6 lines are added in the new version and none are removed. :)

@perry2of5 perry2of5 changed the title Use max delay to handle overflow in TaskInstance next_retry_datetime fixes 47971 handle overflow in TaskInstance next_retry_datetime fixes 47971 Apr 8, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 24, 2025
@uranusjr
Copy link
Member

Is it possible to localise the try-except block a bit more? Currently the block is a bit too large IMO.

@perry2of5
Copy link
Contributor Author

The overflow error is likely two places:

on line 1120:
min_backoff = math.ceil(delay.total_seconds() * (2 ** (self.try_number - 1)))

And on line 1145: https://docs.python.org/3/library/datetime.html#timedelta-objects
delay = timedelta(seconds=delay_backoff_in_seconds)

Keeping the try block as it currently is avoids having two try blocks, but possibly two try blocks would be more readable. I'll fiddle with it and try to improve. Thanks for the feedback.

@perry2of5
Copy link
Contributor Author

perry2of5 commented May 26, 2025

Looking a little closer, we cap MAX_RETRY_DELAY at 24 * 60 * 60 so probably it can't throw at line 1145.... I'll need to be sure that isn't configurable at run time, but as long as it is always less than 999999999 days then I can shorten up the section in the try block. I'll try to get to it tonight.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label May 27, 2025
@perry2of5
Copy link
Contributor Author

Is it possible to localise the try-except block a bit more? Currently the block is a bit too large IMO.

Done. Thanks for the feedback, it is much cleaner now.

@eladkal eladkal requested review from kaxil and uranusjr July 22, 2025 17:54
@eladkal eladkal added this to the Airflow 3.0.4 milestone Jul 22, 2025
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Jul 22, 2025
@perry2of5
Copy link
Contributor Author

FWIW, I think we should drop the log when overflow is caught since we don't log other reasons for capping the delay.

@amoghrajesh
Copy link
Contributor

@uranusjr could you take a look at this PR again when you have some time?

@eladkal eladkal modified the milestones: Airflow 3.0.4, Airflow 3.0.5 Aug 8, 2025
@kaxil kaxil added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label Aug 13, 2025
@kaxil kaxil merged commit 33658f0 into apache:main Aug 13, 2025
58 checks passed
github-actions bot pushed a commit that referenced this pull request Aug 13, 2025
…8557)

This handles overflow when calculating the next execution time for a task instance by falling back to the configured maximum delay. The solution uses the same strategy that tenacity uses:
https://github.com/jd/tenacity/blob/main/tenacity/wait.py#L167

An alternate solution would be the determine the maximum tries that wouldn't exceed the maximum delay and then not calculate the timeout for values larger than that.

Something like

```
max_delay = self.task.max_retry_delay if self.task.max_retry_delay is not null else MAX_RETRY_DELAY
tries_before_max_delay = math.floor(math.log2(max_delay))
if self.try_number <= tries_before_max_delay:
     # existing logic
else:
    delay = max_delay
```
(cherry picked from commit 33658f0)

Co-authored-by: perry2of5 <perry2of5@yahoo.com>
closes: #47971
@github-actions
Copy link

Backport successfully created: v3-0-test

Status Branch Result
v3-0-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Aug 13, 2025
…ache#48557)

This handles overflow when calculating the next execution time for a task instance by falling back to the configured maximum delay. The solution uses the same strategy that tenacity uses:
https://github.com/jd/tenacity/blob/main/tenacity/wait.py#L167

An alternate solution would be the determine the maximum tries that wouldn't exceed the maximum delay and then not calculate the timeout for values larger than that.

Something like

```
max_delay = self.task.max_retry_delay if self.task.max_retry_delay is not null else MAX_RETRY_DELAY
tries_before_max_delay = math.floor(math.log2(max_delay))
if self.try_number <= tries_before_max_delay:
     # existing logic
else:
    delay = max_delay
```
(cherry picked from commit 33658f0)

Co-authored-by: perry2of5 <perry2of5@yahoo.com>
closes: apache#47971
kaxil pushed a commit that referenced this pull request Aug 13, 2025
…8557) (#54460)

This handles overflow when calculating the next execution time for a task instance by falling back to the configured maximum delay. The solution uses the same strategy that tenacity uses:
https://github.com/jd/tenacity/blob/main/tenacity/wait.py#L167

An alternate solution would be the determine the maximum tries that wouldn't exceed the maximum delay and then not calculate the timeout for values larger than that.

Something like

```
max_delay = self.task.max_retry_delay if self.task.max_retry_delay is not null else MAX_RETRY_DELAY
tries_before_max_delay = math.floor(math.log2(max_delay))
if self.try_number <= tries_before_max_delay:
     # existing logic
else:
    delay = max_delay
```
(cherry picked from commit 33658f0)


closes: #47971

Co-authored-by: perry2of5 <perry2of5@yahoo.com>
@perry2of5 perry2of5 deleted the task-next-retry-datetime-fix-overflow-47971 branch August 18, 2025 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retry exponential backoff max float overflow

5 participants