Skip to content

Conversation

@alexott
Copy link
Contributor

@alexott alexott commented Feb 27, 2022

If Databricks control plane receives too many API requests it starts to return HTTP code 429, and caller should retry that request. But Databricks hook retried only on 5xx status code.

closes: #21559

@potiuk
Copy link
Member

potiuk commented Feb 27, 2022

Don't you think this should have exponential back-off on 429 (at the very least but also 500 IMHO). This is highly likely that you will only increase the problem with retrying with fixed retry_delay. We use tenacity for similar cases in other places - so maybe you should change it too @alexott to follow the same pattern ?

@alexott
Copy link
Contributor Author

alexott commented Feb 27, 2022

Yes, I thought about it as well, but I’ll need to think again. Let me mark this PR as draft

@potiuk
Copy link
Member

potiuk commented Feb 27, 2022

Look for tenacity in google providers :)

@potiuk
Copy link
Member

potiuk commented Feb 27, 2022

Or HTTP or SFTP. Same pattern was used there in a number of places.

@alexott alexott changed the title Databricks hook - retry on HTTP Status 429 as well [DRAFT] Databricks hook - retry on HTTP Status 429 as well Feb 27, 2022
@uranusjr
Copy link
Member

uranusjr commented Mar 1, 2022

If I am the API endpoint’s maintainer, I would ver much prefer if a client does not retry if I tell them 429, at least not before my specified Retry-After. 429 tells you to stop, not to try harder.

@potiuk
Copy link
Member

potiuk commented Mar 1, 2022

If I am the API endpoint’s maintainer, I would ver much prefer if a client does not retry if I tell them 429, at least not before my specified Retry-After. 429 tells you to stop, not to try harder.

Exponential backoff we use for that is actually the best of both worlds. It does not stop, but it also decreases the pressure.

@uranusjr
Copy link
Member

uranusjr commented Mar 1, 2022

Exponential backoff still tries too early in most situations. A client receiving 429 is supposed to wait at least until the date specified in the Retry-After response header before any retries.

@potiuk
Copy link
Member

potiuk commented Mar 1, 2022

Exponential backoff still tries too early in most situations. A client receiving 429 is supposed to wait at least until the date specified in the Retry-After response header before any retries.

Sure. Retry-After should be the source of first retry time - but exponential back-off after that does not hurt.

@potiuk
Copy link
Member

potiuk commented Mar 1, 2022

Exponential backoff still tries too early in most situations. A client receiving 429 is supposed to wait at least until the date specified in the Retry-After response header before any retries.

Sure. Retry-After should be the source of first retry time - but exponential back-off after that does not hurt.

Just to add a bit more reasoning (I thought a bit about it).

The problem is that Retry-After is only a hint, not a "source of truth". It relies on the fact the server "knows" what it is doing. Which is not necessary valid:
a) it might be based on past information (which might be outdated and might not include the mounting spike traffic properly) - this happens often
b) it might be simply not there. often you will not get 429 but 5XX in similar situations because it's not only the server that gets flooded but also some gateways on the way or simply the server might timeout or run out of memory or other resources.

So IMHO it needs to be client to decide how to behave. One added value of exponential backoff is that it is still helpful in all retriable conditions that are "unknown" - i.e. 5XX. Those do not contain "Retry-After" to base your decision on.

So I think exponential back-off with some initial timeout (if Retry-After is available - it should be a starting point) should be the right approach. Additionally if even including exponential back-off, you get 429 with Retry-After where your exponential back-off next step is not long enough - the timeout should be re-adjusted to the "Retry-After" received. But if it is longer, then we should continue exponential back-off (because server information might be already out-dated and not include mounting traffic spike).

@alexott alexott force-pushed the databricks-retry-on-429 branch from c73e3fe to b0fae79 Compare March 6, 2022 16:16
@alexott alexott requested a review from mik-laj as a code owner March 6, 2022 16:16
it's now uses exponential backoff by default
@alexott alexott force-pushed the databricks-retry-on-429 branch from b0fae79 to 8f7911b Compare March 6, 2022 16:43
@alexott alexott changed the title [DRAFT] Databricks hook - retry on HTTP Status 429 as well Databricks hook - retry on HTTP Status 429 as well Mar 6, 2022
@alexott
Copy link
Contributor Author

alexott commented Mar 6, 2022

@potiuk I think that it's ready for review now...

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @uranusjr ?

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Mar 8, 2022
@github-actions
Copy link

github-actions bot commented Mar 8, 2022

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers okay to merge It's ok to merge this PR as it does not require more tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Databricks hook: Retry also on HTTP Status 429 - rate limit exceeded

3 participants