Skip to content

Refresh DruidLeaderClient cache selectively for non-200 responses#14092

Merged
gianm merged 8 commits intoapache:masterfrom
abhishek-chouhan:master
Apr 20, 2023
Merged

Refresh DruidLeaderClient cache selectively for non-200 responses#14092
gianm merged 8 commits intoapache:masterfrom
abhishek-chouhan:master

Conversation

@abhishek-chouhan
Copy link
Copy Markdown
Contributor

@abhishek-chouhan abhishek-chouhan commented Apr 14, 2023

Fixes #14090.

Description

Introduces retries for non-200 responses along with cache invalidation. Caller can still control the behavior by passing in the expected responses (in addition to 200), in which case the response is returned without any internal retries. The internal retries with cache-invalidation are best-effort, in case of continuous failure, failed response is returned to the caller. This keeps the existing behavior of the API as is.

Release note


Key changes in this PR
  • DruidLeaderClient

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

Instead of each caller passing through an acceptable response code, why not just do a retry (along with cache invalidation) on any 5xx? I think that's what happens after your change anyway. we wouldn't want to invalidate the cache if there is a 401. when the client needs to find a new leader, doesn't seem to be something that a class like LookupReferenceManager should decide. and if its just retries, we can do that for 503 and 504 no matter where the request comes from.

503 and 504 are more likely to be thrown by the proxy than from the leader service. Is there anything else in the response from the proxy that could be used to detect such setup and then we can use that for a targeted retry? If not, we can still retry (with cache invalidation for 503/504).

@abhishek-chouhan
Copy link
Copy Markdown
Contributor Author

abhishek-chouhan commented Apr 18, 2023

@abhishekagarwal87 Thanks for taking a look at this.

why not just do a retry (along with cache invalidation) on any 5xx?

I was in 2 minds about this as well. It seemed like giving the callers an option to override this behavior, would provide the flexibility such that if in the future they would want to handle a 5xx selectively, without retry, then they'd have this option. Otherwise there would be no way to handle or expect a 5xx response from the node, internally the client would always think this as inappropriate and make additional requests, before handing it back to the caller. I do agree that it makes it a bit more complicated for classes like LookupReferenceManager.
Let me know what you think, if we feel like this would not be the case, we can make changes to retry for 503/504. Or maybe we could make this a default behavior, but then provide another API with a boolean which disables this behavior for that specific call.

@abhishek-chouhan
Copy link
Copy Markdown
Contributor Author

@abhishekagarwal87 Made changes so that we do implicit retries for 503/504. Added a boolean flag to disable retries for 503/504 if needed for a particular request.

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

@abhishek-chouhan - We can get rid of the function signature with the boolean flag and always retry for 503/504. Anyway, the false flag is not being used anywhere. We can add it later if we ever need it.

@abhishek-chouhan
Copy link
Copy Markdown
Contributor Author

Done!

));

request = withUrl(request, redirectUrl);
} else if (HttpResponseStatus.SERVICE_UNAVAILABLE.equals(responseStatus)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this but if there is a retry happening in response to 503/504, there will be no way to see that happening in logs. We should be logging a warning here. what do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added a log.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 20, 2023

A note, there's also the newer ServiceClient and ServiceLocator (with high-level clients like OverlordClient and CoordinatorServiceClient) that I don't think have this problem. They use DruidNodeDiscovery.Listener rather than periodic getAllNodes, so they update continuously as the services come and go.

#12696 introduced these and discusses them in more detail. The PR calls out some issues with DruidLeaderClient:

DruidLeaderClient does retries in its "go" method, but only retries
exactly 5 times, does not sleep between retries, and does not retry
retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.)
ServiceClient handles retries in a more reasonable way.

DruidLeaderClient's methods are all synchronous, whereas ServiceClient
methods are asynchronous. This is used in one place so far: the
SpecificTaskServiceLocator, so we don't need to block a thread trying
to locate a task. It can be used in other places in the future.

So, we can likely improve further by migrating things to ServiceClient.

@gianm gianm merged commit 895abd8 into apache:master Apr 20, 2023
@abhishek-chouhan
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews and commit @abhishekagarwal87 @gianm !

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

thank you for your contribution @abhishek-chouhan. Looking forward to many more 😉

@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DruidLeaderClient should refresh cache for non-200 responses

4 participants