Skip to content

Looping issue using Hashicorp Vault #9713

@benbenbang

Description

@benbenbang
  • Issue:
    We encountered an issue that is mainly caused by overloading calls from Airflow to Vault while Vault was experiencing downtime.

  • Context:
    A few days ago while Vault was down for 2 - 3 hours, all the worker pods in our Airflow cluster on K8s started retrying to pull credentials (URI) from Vault. This caused two things: the first one is that we found that airflow tried more than 500 calls per hour to call Vault and then lead to the second issue: our DevOps couldn't connect to Vault and figured out the root cause that brought down Vault. Thus Airflow became the target of blame...

  • Question:
    Is there any built-in retry strategy to prevent this kind of issue happens again? To generalize this issue, it can occur while using any secret manager, like AWS secrets manager, Google sm, or even RDS.
    If not and if you think it is worth being implemented, I would like to take this ticket to implement this kind of strategy to prevent this issue.

Thanks

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions