Skip to content

Conversation

@hussein-awala
Copy link
Member

@hussein-awala hussein-awala commented Apr 16, 2023

closes: #30552
closes: #30572
related: #26993
related: #18080 (it may close it too, I'll check before merging)
closes: #15645


This PR aims to address two interrelated issues:

  • The first issue involves the TI start_date in rescheduled sensors, which gets erroneously updated in each new poke.
  • The second issue relates to the confusion between the try number (TI try number) and the sensor poke number in the same TI retry. Currently, we are utilizing the try_number to compute the next_poke_interval, which has an impact on the new reschedule_date.

Although the two issues are closely linked, we may choose to divide them into separate pull requests based on the final implementation.

I will fix the unit tests, add new tests and add the migration scripts for the new column(s) once everything works as expected in my test dag.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeek, let's think very very carefully before we repeat the same mistake we had with Try Number with a new column.

The facts that try number is mutated in place has been causing us problems for over 5 years. We shouldn't ever mutate a row in place like this really.

@hussein-awala
Copy link
Member Author

I agree, but I will continue working on the PR at least to identify all the problems, and I'm open for all suggestions that's why I opened this draft PR.

The new column is added to solve the issue with the poke number where it's state is not stored between the different reschedules:

airflow=# select * from task_reschedule;
 id |      task_id      |          dag_id           |                   run_id                    | map_index | try_number | poke_number |          start_date           |           end_date            | duration |        reschedule_date        
----+-------------------+---------------------------+---------------------------------------------+-----------+------------+-------------+-------------------------------+-------------------------------+-----------------------------------------
 81 | wait_for_external | example_sensor_reschedule | scheduled__2023-04-16T14:22:26.249199+00:00 |        -1 |          1 |           1 | 2023-04-16 14:24:32.003232+00 | 2023-04-16 14:24:32.061204+00 |        0 | 2023-04-16 14:24:42.056541+00
 83 | wait_for_external | example_sensor_reschedule | scheduled__2023-04-16T14:22:26.249199+00:00 |        -1 |          1 |           2 | 2023-04-16 14:27:58.576074+00 | 2023-04-16 14:27:58.641273+00 |        0 | 2023-04-16 14:28:08.637106+00
 85 | wait_for_external | example_sensor_reschedule | scheduled__2023-04-16T14:22:26.249199+00:00 |        -1 |          1 |           3 | 2023-04-16 14:28:09.70126+00  | 2023-04-16 14:28:09.847813+00 |        0 | 2023-04-16 14:28:20.112106+00

As I know, there is two methods to save the state between the different executions, by adding a new column in the metadata db, or storing it in a xcom (not a good solution in our case)

@potiuk
Copy link
Member

potiuk commented Apr 16, 2023

Yes. I agree with @ashb that the try_number mutation has been a bummer and has some historical connotations that are non-obvious and it should be very, very carefully checked. Especially all the more exotic scenarions: retries on failure, backfills, manual runs, etc. etc.

Especially I think it might be worth to look at past PRs and issues where "try_num" has been mentioned and see all the times it's been attempted to fix it.

It might be solved, sure, but it should be carefully tested - not only via unit tests but also including likely manually going trough set of test cases that will be worked out based on those historical context - and maybe even working out some new test scenarios.

For me this one is the kind of issues that are close to one of the best comments described here:

https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered

//
// Dear maintainer:
//
// Once you are done trying to 'optimize' this routine,
// and have realized what a terrible mistake that was,
// please increment the following counter as a warning
// to the next guy:
//
// total_hours_wasted_here = 42
//

Maybe not as difficult, but likely with similar level of non-obviousness.

@uranusjr
Copy link
Member

uranusjr commented Apr 17, 2023

Now that we have map_index thatr breaks the previously one-to-one task-ti relation anyway, I wonder if it’s best if every try simply inserts a new task instance instead of mutating an existing one.

Another possibility (if doing max(try_number) is problematic in some SQL queries) would be to add a table that stores each try of a task instance.

@potiuk
Copy link
Member

potiuk commented Apr 17, 2023

Another possibility (if doing max(try_number) is problematic in some SQL queries) would be to add a table that stores each try of a task instance.

I would be for doing that. This would solve some of the problems (for example retrieving logs from a different celery worker when it is run multiple times)

@awbush
Copy link
Contributor

awbush commented Apr 27, 2023

Hi all, I'm the author of #30653 and just wanted to share results of monkey patching our large airflow instance containing thousands of DAGs with that PR: we've had no more issues and our Sentry alerts for this issue are silent. 🎉

Remember that scheduler_job_runner.py works around this issue. All I did was make backfill job runner behave the same, as I understood from looking at TaskInstanceKey and traversing years of history that a wider change would be difficult.

Totally happy to see official airflow get some fix, even if it isn't mine, but thought I'd share results ^ since my PR got closed.

Cheers!

@potiuk
Copy link
Member

potiuk commented Apr 29, 2023

Hi all, I'm the author of #30653 and just wanted to share results of monkey patching our large airflow instance containing thousands of DAGs with that PR: we've had no more issues and our Sentry alerts for this issue are silent. tada

Remember that scheduler_job_runner.py works around this issue. All I did was make backfill job runner behave the same, as I understood from looking at TaskInstanceKey and traversing years of history that a wider change would be difficult.

Totally happy to see official airflow get some fix, even if it isn't mine, but thought I'd share results ^ since my PR got closed.

Ah- thanks for the context. I re-opened it then. Maybe it is indeed worth to implement it then (I will take a closer look - because regardless of try_num calculation fix - which will only be possible to implement in 2.7 , the quick-fix to backfill job of yours might be applicable as a patch in 2.6.*

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 14, 2023
@hussein-awala hussein-awala added pinned Protect from Stalebot auto closing and removed stale Stale PRs per the .github/workflows/stale.yml policy file labels Jun 15, 2023
Copy link
Member

@pankajkoti pankajkoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the recent fixes around try_number from Daniel Standish would we still need this?

@uranusjr
Copy link
Member

cc @dstandish

@dstandish
Copy link
Contributor

With the recent fixes around try_number from Daniel Standish would we still need this?

looks like we don't need the try number stuff; if we need the other stuff, probably best to make a new pr

@dstandish dstandish closed this Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pinned Protect from Stalebot auto closing

Projects

None yet

7 participants