Refresh DruidLeaderClient cache selectively for non-200 responses by abhishek-chouhan · Pull Request #14092 · apache/druid

abhishek-chouhan · 2023-04-14T21:01:40Z

Description

Introduces retries for non-200 responses along with cache invalidation. Caller can still control the behavior by passing in the expected responses (in addition to 200), in which case the response is returned without any internal retries. The internal retries with cache-invalidation are best-effort, in case of continuous failure, failed response is returned to the caller. This keeps the existing behavior of the API as is.

Release note

Key changes in this PR

DruidLeaderClient

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

abhishekagarwal87 · 2023-04-18T08:28:18Z

Instead of each caller passing through an acceptable response code, why not just do a retry (along with cache invalidation) on any 5xx? I think that's what happens after your change anyway. we wouldn't want to invalidate the cache if there is a 401. when the client needs to find a new leader, doesn't seem to be something that a class like LookupReferenceManager should decide. and if its just retries, we can do that for 503 and 504 no matter where the request comes from.

503 and 504 are more likely to be thrown by the proxy than from the leader service. Is there anything else in the response from the proxy that could be used to detect such setup and then we can use that for a targeted retry? If not, we can still retry (with cache invalidation for 503/504).

abhishek-chouhan · 2023-04-18T18:35:48Z

@abhishekagarwal87 Thanks for taking a look at this.

why not just do a retry (along with cache invalidation) on any 5xx?

I was in 2 minds about this as well. It seemed like giving the callers an option to override this behavior, would provide the flexibility such that if in the future they would want to handle a 5xx selectively, without retry, then they'd have this option. Otherwise there would be no way to handle or expect a 5xx response from the node, internally the client would always think this as inappropriate and make additional requests, before handing it back to the caller. I do agree that it makes it a bit more complicated for classes like LookupReferenceManager.
Let me know what you think, if we feel like this would not be the case, we can make changes to retry for 503/504. Or maybe we could make this a default behavior, but then provide another API with a boolean which disables this behavior for that specific call.

abhishek-chouhan · 2023-04-19T01:12:33Z

@abhishekagarwal87 Made changes so that we do implicit retries for 503/504. Added a boolean flag to disable retries for 503/504 if needed for a particular request.

abhishekagarwal87 · 2023-04-19T08:56:53Z

@abhishek-chouhan - We can get rid of the function signature with the boolean flag and always retry for 503/504. Anyway, the false flag is not being used anywhere. We can add it later if we ever need it.

abhishek-chouhan · 2023-04-19T16:34:54Z

Done!

abhishekagarwal87 · 2023-04-20T03:25:09Z

        ));

        request = withUrl(request, redirectUrl);
+      } else if (HttpResponseStatus.SERVICE_UNAVAILABLE.equals(responseStatus)


I missed this but if there is a retry happening in response to 503/504, there will be no way to see that happening in logs. We should be logging a warning here. what do you think?

Makes sense. Added a log.

gianm · 2023-04-20T08:45:37Z

A note, there's also the newer ServiceClient and ServiceLocator (with high-level clients like OverlordClient and CoordinatorServiceClient) that I don't think have this problem. They use DruidNodeDiscovery.Listener rather than periodic getAllNodes, so they update continuously as the services come and go.

#12696 introduced these and discusses them in more detail. The PR calls out some issues with DruidLeaderClient:

DruidLeaderClient does retries in its "go" method, but only retries
exactly 5 times, does not sleep between retries, and does not retry
retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.)
ServiceClient handles retries in a more reasonable way.

DruidLeaderClient's methods are all synchronous, whereas ServiceClient
methods are asynchronous. This is used in one place so far: the
SpecificTaskServiceLocator, so we don't need to block a thread trying
to locate a task. It can be used in other places in the future.

So, we can likely improve further by migrating things to ServiceClient.

abhishek-chouhan · 2023-04-20T17:10:46Z

Thanks for the reviews and commit @abhishekagarwal87 @gianm !

abhishekagarwal87 · 2023-04-21T04:04:29Z

thank you for your contribution @abhishek-chouhan. Looking forward to many more 😉

Refresh DruidLeaderClient cache for non-200 responses

73769ff

github-advanced-security AI found potential problems Apr 14, 2023

View reviewed changes

Comment thread server/src/test/java/org/apache/druid/discovery/DruidLeaderClientTest.java Fixed

Comment thread server/src/test/java/org/apache/druid/discovery/DruidLeaderClientTest.java Fixed

Change local variable name to avoid confusion

cbbbecf

Implicit retries for 503 and 504

ce73e52

abhishek-chouhan added 2 commits April 18, 2023 19:26

Remove unused imports

f8e5122

Use argumentmatcher instead of Mockito for #any in test

a6662d2

abhishekagarwal87 added the Improvement label Apr 19, 2023

Remove flag to disable retry for 503/504

e51c352

Remove unused import from test

e288a90

abhishekagarwal87 approved these changes Apr 20, 2023

View reviewed changes

abhishekagarwal87 reviewed Apr 20, 2023

View reviewed changes

Add log line for internal retry

beb8817

gianm merged commit 895abd8 into apache:master Apr 20, 2023

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh DruidLeaderClient cache selectively for non-200 responses#14092

Refresh DruidLeaderClient cache selectively for non-200 responses#14092
gianm merged 8 commits intoapache:masterfrom
abhishek-chouhan:master

abhishek-chouhan commented Apr 14, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

abhishekagarwal87 commented Apr 18, 2023

Uh oh!

abhishek-chouhan commented Apr 18, 2023 •

edited

Loading

Uh oh!

abhishek-chouhan commented Apr 19, 2023

Uh oh!

abhishekagarwal87 commented Apr 19, 2023

Uh oh!

abhishek-chouhan commented Apr 19, 2023

Uh oh!

abhishekagarwal87 Apr 20, 2023

Uh oh!

abhishek-chouhan Apr 20, 2023

Uh oh!

gianm commented Apr 20, 2023 •

edited

Loading

Uh oh!

abhishek-chouhan commented Apr 20, 2023

Uh oh!

abhishekagarwal87 commented Apr 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

abhishek-chouhan commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Key changes in this PR

Uh oh!

Uh oh!

Uh oh!

abhishekagarwal87 commented Apr 18, 2023

Uh oh!

abhishek-chouhan commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhishek-chouhan commented Apr 19, 2023

Uh oh!

abhishekagarwal87 commented Apr 19, 2023

Uh oh!

abhishek-chouhan commented Apr 19, 2023

Uh oh!

abhishekagarwal87 Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

abhishek-chouhan Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

gianm commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhishek-chouhan commented Apr 20, 2023

Uh oh!

abhishekagarwal87 commented Apr 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abhishek-chouhan commented Apr 14, 2023 •

edited

Loading

abhishek-chouhan commented Apr 18, 2023 •

edited

Loading

gianm commented Apr 20, 2023 •

edited

Loading