Add Retry Mechanism for Vault Connection Failures #30628
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a retry mechanism to handle Vault connection failures more gracefully. By employing the urllib3.util.Retry class, we can configure the retry behavior for specific HTTP status codes that are more likely to be transient, such as 412 (Precondition Failed), 500 (Internal Server Error), 502 (Bad Gateway), and 503 (Service Unavailable).
These changes are implemented within the Vault client file, using the hvac library, to enhance the robustness and reliability of the connection with Vault. The hvac library's Client class allows for the injection of a custom Session object, which we can configure with an HTTPAdapter instance to apply our desired retry behavior.
As mentioned in the hvac documentation, thoughtful retrying of failed requests is crucial for a seamless experience with Vault, especially in the context of eventual consistency. Vault may return a 412 status code when data is not yet available on the node where the request was made. In addition, retrying 5xx status codes is generally advisable.
With these changes, Airflow's integration with Vault becomes more resilient and fault-tolerant, improving the overall user experience and stability of the system.