[WIP] Fix try_number calculation and add a new column poke_number #30669

hussein-awala · 2023-04-16T15:16:03Z

closes: #30552
closes: #30572
related: #26993
related: #18080 (it may close it too, I'll check before merging)
closes: #15645

This PR aims to address two interrelated issues:

The first issue involves the TI start_date in rescheduled sensors, which gets erroneously updated in each new poke.
The second issue relates to the confusion between the try number (TI try number) and the sensor poke number in the same TI retry. Currently, we are utilizing the try_number to compute the next_poke_interval, which has an impact on the new reschedule_date.

Although the two issues are closely linked, we may choose to divide them into separate pull requests based on the final implementation.

I will fix the unit tests, add new tests and add the migration scripts for the new column(s) once everything works as expected in my test dag.

ashb

Eeek, let's think very very carefully before we repeat the same mistake we had with Try Number with a new column.

The facts that try number is mutated in place has been causing us problems for over 5 years. We shouldn't ever mutate a row in place like this really.

hussein-awala · 2023-04-16T22:00:22Z

I agree, but I will continue working on the PR at least to identify all the problems, and I'm open for all suggestions that's why I opened this draft PR.

The new column is added to solve the issue with the poke number where it's state is not stored between the different reschedules:

airflow=# select * from task_reschedule;
 id |      task_id      |          dag_id           |                   run_id                    | map_index | try_number | poke_number |          start_date           |           end_date            | duration |        reschedule_date        
----+-------------------+---------------------------+---------------------------------------------+-----------+------------+-------------+-------------------------------+-------------------------------+-----------------------------------------
 81 | wait_for_external | example_sensor_reschedule | scheduled__2023-04-16T14:22:26.249199+00:00 |        -1 |          1 |           1 | 2023-04-16 14:24:32.003232+00 | 2023-04-16 14:24:32.061204+00 |        0 | 2023-04-16 14:24:42.056541+00
 83 | wait_for_external | example_sensor_reschedule | scheduled__2023-04-16T14:22:26.249199+00:00 |        -1 |          1 |           2 | 2023-04-16 14:27:58.576074+00 | 2023-04-16 14:27:58.641273+00 |        0 | 2023-04-16 14:28:08.637106+00
 85 | wait_for_external | example_sensor_reschedule | scheduled__2023-04-16T14:22:26.249199+00:00 |        -1 |          1 |           3 | 2023-04-16 14:28:09.70126+00  | 2023-04-16 14:28:09.847813+00 |        0 | 2023-04-16 14:28:20.112106+00

As I know, there is two methods to save the state between the different executions, by adding a new column in the metadata db, or storing it in a xcom (not a good solution in our case)

potiuk · 2023-04-16T23:19:10Z

Yes. I agree with @ashb that the try_number mutation has been a bummer and has some historical connotations that are non-obvious and it should be very, very carefully checked. Especially all the more exotic scenarions: retries on failure, backfills, manual runs, etc. etc.

Especially I think it might be worth to look at past PRs and issues where "try_num" has been mentioned and see all the times it's been attempted to fix it.

It might be solved, sure, but it should be carefully tested - not only via unit tests but also including likely manually going trough set of test cases that will be worked out based on those historical context - and maybe even working out some new test scenarios.

For me this one is the kind of issues that are close to one of the best comments described here:

https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered

//
// Dear maintainer:
//
// Once you are done trying to 'optimize' this routine,
// and have realized what a terrible mistake that was,
// please increment the following counter as a warning
// to the next guy:
//
// total_hours_wasted_here = 42
//

Maybe not as difficult, but likely with similar level of non-obviousness.

uranusjr · 2023-04-17T05:14:53Z

Now that we have map_index thatr breaks the previously one-to-one task-ti relation anyway, I wonder if it’s best if every try simply inserts a new task instance instead of mutating an existing one.

Another possibility (if doing max(try_number) is problematic in some SQL queries) would be to add a table that stores each try of a task instance.

potiuk · 2023-04-17T08:34:40Z

Another possibility (if doing max(try_number) is problematic in some SQL queries) would be to add a table that stores each try of a task instance.

I would be for doing that. This would solve some of the problems (for example retrieving logs from a different celery worker when it is run multiple times)

awbush · 2023-04-27T21:16:06Z

Hi all, I'm the author of #30653 and just wanted to share results of monkey patching our large airflow instance containing thousands of DAGs with that PR: we've had no more issues and our Sentry alerts for this issue are silent. 🎉

Remember that scheduler_job_runner.py works around this issue. All I did was make backfill job runner behave the same, as I understood from looking at TaskInstanceKey and traversing years of history that a wider change would be difficult.

Totally happy to see official airflow get some fix, even if it isn't mine, but thought I'd share results ^ since my PR got closed.

Cheers!

potiuk · 2023-04-29T15:47:59Z

Hi all, I'm the author of #30653 and just wanted to share results of monkey patching our large airflow instance containing thousands of DAGs with that PR: we've had no more issues and our Sentry alerts for this issue are silent. tada

Remember that scheduler_job_runner.py works around this issue. All I did was make backfill job runner behave the same, as I understood from looking at TaskInstanceKey and traversing years of history that a wider change would be difficult.

Totally happy to see official airflow get some fix, even if it isn't mine, but thought I'd share results ^ since my PR got closed.

Ah- thanks for the context. I re-opened it then. Maybe it is indeed worth to implement it then (I will take a closer look - because regardless of try_num calculation fix - which will only be possible to implement in 2.7 , the quick-fix to backfill job of yours might be applicable as a patch in 2.6.*

github-actions · 2023-06-14T00:11:30Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

pankajkoti

With the recent fixes around try_number from Daniel Standish would we still need this?

uranusjr · 2024-06-20T09:45:34Z

cc @dstandish

dstandish · 2024-06-20T13:38:11Z

With the recent fixes around try_number from Daniel Standish would we still need this?

looks like we don't need the try number stuff; if we need the other stuff, probably best to make a new pr

Fix try_number calculation and add a new column poke_number

c8a1a7b

hussein-awala requested review from XD-DENG, ashb and kaxil as code owners April 16, 2023 15:16

boring-cyborg bot added the area:core-operators label Apr 16, 2023

ashb reviewed Apr 16, 2023

View reviewed changes

hussein-awala added 2 commits April 17, 2023 00:33

use actual try_number instead of the calculated one

6cca199

add migration file

b8eb0a3

hussein-awala requested review from bolkedebruin and potiuk as code owners April 16, 2023 22:45

This was referenced Apr 19, 2023

Failed to fetch log file from worker. 404 Client Error: NOT FOUND #30743

Closed

Issue on the try_number is not in the loop #30571

Closed

Fix backfill KeyError when try_number out of sync #30653

Merged

AVMusorin mentioned this pull request May 19, 2023

ZeroDivisionError in BaseSensorOperator with exponential_backoff=True and poke_interval=1 #31409

Closed

2 tasks

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 14, 2023

hussein-awala added pinned Protect from Stalebot auto closing and removed stale Stale PRs per the .github/workflows/stale.yml policy file labels Jun 15, 2023

tirkarthi mentioned this pull request Apr 13, 2024

Increment try_number while clearing deferred tasks. #38984

Closed

sanket2000 mentioned this pull request May 25, 2024

Exponential Backoff Not Functioning in BaseSensorOperator Reschedule Mode #39823

Merged

pankajkoti reviewed Jun 18, 2024

View reviewed changes

dstandish closed this Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix try_number calculation and add a new column poke_number #30669

[WIP] Fix try_number calculation and add a new column poke_number #30669

Uh oh!

hussein-awala commented Apr 16, 2023 •

edited

Loading

Uh oh!

ashb left a comment

Uh oh!

hussein-awala commented Apr 16, 2023

Uh oh!

potiuk commented Apr 16, 2023 •

edited

Loading

Uh oh!

uranusjr commented Apr 17, 2023 •

edited

Loading

Uh oh!

potiuk commented Apr 17, 2023

Uh oh!

awbush commented Apr 27, 2023 •

edited

Loading

Uh oh!

potiuk commented Apr 29, 2023

Uh oh!

github-actions bot commented Jun 14, 2023

Uh oh!

pankajkoti left a comment

Uh oh!

uranusjr commented Jun 20, 2024

Uh oh!

dstandish commented Jun 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[WIP] Fix try_number calculation and add a new column poke_number #30669

[WIP] Fix try_number calculation and add a new column poke_number #30669

Uh oh!

Conversation

hussein-awala commented Apr 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

hussein-awala commented Apr 16, 2023

Uh oh!

potiuk commented Apr 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uranusjr commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Apr 17, 2023

Uh oh!

awbush commented Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Apr 29, 2023

Uh oh!

github-actions bot commented Jun 14, 2023

Uh oh!

pankajkoti left a comment

Choose a reason for hiding this comment

Uh oh!

uranusjr commented Jun 20, 2024

Uh oh!

dstandish commented Jun 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hussein-awala commented Apr 16, 2023 •

edited

Loading

potiuk commented Apr 16, 2023 •

edited

Loading

uranusjr commented Apr 17, 2023 •

edited

Loading

awbush commented Apr 27, 2023 •

edited

Loading