Skip to content

Conversation

@zachliu
Copy link
Contributor

@zachliu zachliu commented Mar 4, 2022

In an old PR #14263 we tried to add a "retry" mechanism for triggering new ECS task but it didn't work as expected according to my explanation in #16150.

The TL;DR version is:

  • The immediate API response after sending out the run_task request to aws is not reliable. To make sure our task is really running in good health, we have to wait and send out another describe_tasks request.

This explanation still holds today. Since people start using Fargate nowadays, this issue becomes more and more acute. On a good day, 0.15% Fargate tasks could fail to start due to ENI (AWS Elastic Network Interface) provision failures. However on a bad day, it could rise to 1~2%.

Per recommendations from the AWS support team, we should consider the "triggering an ECS task, waiting for it to be provisioned, checking for its status" steps a single routine that can be retried as a whole.

This PR is largely based on the framework established in #14263.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Mar 4, 2022
@zachliu zachliu force-pushed the ecsoperator-retry-on-eni-provisioning-failures branch from b5f1e36 to a6b8be3 Compare March 4, 2022 21:42
@potiuk potiuk merged commit 01a1a26 into apache:main Mar 7, 2022
@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Mar 7, 2022
@github-actions
Copy link

github-actions bot commented Mar 7, 2022

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@zachliu zachliu deleted the ecsoperator-retry-on-eni-provisioning-failures branch March 7, 2022 15:24
@zachliu zachliu mentioned this pull request Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers okay to merge It's ok to merge this PR as it does not require more tests provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants