-
Notifications
You must be signed in to change notification settings - Fork 16.4k
[WIP] Fix try_number calculation and add a new column poke_number #30669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eeek, let's think very very carefully before we repeat the same mistake we had with Try Number with a new column.
The facts that try number is mutated in place has been causing us problems for over 5 years. We shouldn't ever mutate a row in place like this really.
|
I agree, but I will continue working on the PR at least to identify all the problems, and I'm open for all suggestions that's why I opened this draft PR. The new column is added to solve the issue with the poke number where it's state is not stored between the different reschedules: As I know, there is two methods to save the state between the different executions, by adding a new column in the metadata db, or storing it in a xcom (not a good solution in our case) |
|
Yes. I agree with @ashb that the try_number mutation has been a bummer and has some historical connotations that are non-obvious and it should be very, very carefully checked. Especially all the more exotic scenarions: retries on failure, backfills, manual runs, etc. etc. Especially I think it might be worth to look at past PRs and issues where "try_num" has been mentioned and see all the times it's been attempted to fix it. It might be solved, sure, but it should be carefully tested - not only via unit tests but also including likely manually going trough set of test cases that will be worked out based on those historical context - and maybe even working out some new test scenarios. For me this one is the kind of issues that are close to one of the best comments described here:
Maybe not as difficult, but likely with similar level of non-obviousness. |
|
Now that we have Another possibility (if doing |
I would be for doing that. This would solve some of the problems (for example retrieving logs from a different celery worker when it is run multiple times) |
|
Hi all, I'm the author of #30653 and just wanted to share results of monkey patching our large airflow instance containing thousands of DAGs with that PR: we've had no more issues and our Sentry alerts for this issue are silent. 🎉 Remember that scheduler_job_runner.py works around this issue. All I did was make backfill job runner behave the same, as I understood from looking at TaskInstanceKey and traversing years of history that a wider change would be difficult. Totally happy to see official airflow get some fix, even if it isn't mine, but thought I'd share results ^ since my PR got closed. Cheers! |
Ah- thanks for the context. I re-opened it then. Maybe it is indeed worth to implement it then (I will take a closer look - because regardless of try_num calculation fix - which will only be possible to implement in 2.7 , the quick-fix to backfill job of yours might be applicable as a patch in 2.6.* |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
pankajkoti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the recent fixes around try_number from Daniel Standish would we still need this?
|
cc @dstandish |
looks like we don't need the try number stuff; if we need the other stuff, probably best to make a new pr |
closes: #30552
closes: #30572
related: #26993
related: #18080 (it may close it too, I'll check before merging)
closes: #15645
This PR aims to address two interrelated issues:
try_numberto compute thenext_poke_interval, which has an impact on the newreschedule_date.Although the two issues are closely linked, we may choose to divide them into separate pull requests based on the final implementation.
I will fix the unit tests, add new tests and add the migration scripts for the new column(s) once everything works as expected in my test dag.