-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
-
Issue:
We encountered an issue that is mainly caused by overloading calls from Airflow to Vault while Vault was experiencing downtime. -
Context:
A few days ago while Vault was down for 2 - 3 hours, all the worker pods in our Airflow cluster on K8s started retrying to pull credentials (URI) from Vault. This caused two things: the first one is that we found that airflow tried more than 500 calls per hour to call Vault and then lead to the second issue: our DevOps couldn't connect to Vault and figured out the root cause that brought down Vault. Thus Airflow became the target of blame... -
Question:
Is there any built-in retry strategy to prevent this kind of issue happens again? To generalize this issue, it can occur while using any secret manager, like AWS secrets manager, Google sm, or even RDS.
If not and if you think it is worth being implemented, I would like to take this ticket to implement this kind of strategy to prevent this issue.
Thanks