-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Databricks hook - retry on HTTP Status 429 as well #21852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Don't you think this should have exponential back-off on 429 (at the very least but also 500 IMHO). This is highly likely that you will only increase the problem with retrying with fixed retry_delay. We use tenacity for similar cases in other places - so maybe you should change it too @alexott to follow the same pattern ? |
|
Yes, I thought about it as well, but I’ll need to think again. Let me mark this PR as draft |
|
Look for tenacity in google providers :) |
|
Or HTTP or SFTP. Same pattern was used there in a number of places. |
|
If I am the API endpoint’s maintainer, I would ver much prefer if a client does not retry if I tell them 429, at least not before my specified |
Exponential backoff we use for that is actually the best of both worlds. It does not stop, but it also decreases the pressure. |
|
Exponential backoff still tries too early in most situations. A client receiving 429 is supposed to wait at least until the date specified in the |
Sure. Retry-After should be the source of first retry time - but exponential back-off after that does not hurt. |
Just to add a bit more reasoning (I thought a bit about it). The problem is that Retry-After is only a hint, not a "source of truth". It relies on the fact the server "knows" what it is doing. Which is not necessary valid: So IMHO it needs to be client to decide how to behave. One added value of exponential backoff is that it is still helpful in all retriable conditions that are "unknown" - i.e. 5XX. Those do not contain "Retry-After" to base your decision on. So I think exponential back-off with some initial timeout (if Retry-After is available - it should be a starting point) should be the right approach. Additionally if even including exponential back-off, you get 429 with Retry-After where your exponential back-off next step is not long enough - the timeout should be re-adjusted to the "Retry-After" received. But if it is longer, then we should continue exponential back-off (because server information might be already out-dated and not include mounting traffic spike). |
c73e3fe to
b0fae79
Compare
it's now uses exponential backoff by default
b0fae79 to
8f7911b
Compare
|
@potiuk I think that it's ready for review now... |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @uranusjr ?
|
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
If Databricks control plane receives too many API requests it starts to return HTTP code 429, and caller should retry that request. But Databricks hook retried only on 5xx status code.
closes: #21559