retry on very specific eni provision failures #22002
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In an old PR #14263 we tried to add a "retry" mechanism for triggering new ECS task but it didn't work as expected according to my explanation in #16150.
The TL;DR version is:
run_taskrequest to aws is not reliable. To make sure our task is really running in good health, we have to wait and send out anotherdescribe_tasksrequest.This explanation still holds today. Since people start using Fargate nowadays, this issue becomes more and more acute. On a good day, 0.15% Fargate tasks could fail to start due to ENI (AWS Elastic Network Interface) provision failures. However on a bad day, it could rise to 1~2%.
Per recommendations from the AWS support team, we should consider the "triggering an ECS task, waiting for it to be provisioned, checking for its status" steps a single routine that can be retried as a whole.
This PR is largely based on the framework established in #14263.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.