Mid-level service client and updated high-level clients. by gianm · Pull Request #12696 · apache/druid

gianm · 2022-06-23T05:23:23Z

Our servers talk to each other over HTTP. We have a low-level HTTP
client (HttpClient) that is super-asynchronous and super-customizable
through its handlers. It's also proven to be quite robust: we use it
for Broker -> Historical communication over the wide variety of query
types and workloads we support.

But the low-level client has no facilities for service location or
retries, which means we have a variety of high-level clients that
implement these in their own ways. Some high-level clients do a better
job than others. This patch adds a mid-level ServiceClient that makes
it easier for high-level clients to be built correctly and harmoniously,
and migrates some of the high-level logic to use ServiceClients.

Main changes:

Add ServiceClient org.apache.druid.rpc package. That package also
contains supporting stuff like ServiceLocator and RetryPolicy
interfaces, and a DiscoveryServiceLocator based on
DruidNodeDiscoveryProvider.
Add high-level OverlordClient in org.apache.druid.rpc.indexing.
Indexing task client creator in TaskServiceClients. It uses
SpecificTaskServiceLocator to find the tasks. This improves on
ClientInfoTaskProvider by caching task locations for up to 30 seconds
across calls, reducing load on the Overlord.
Rework ParallelIndexSupervisorTaskClient to use a ServiceClient
instead of extending IndexTaskClient.
Rework RemoteTaskActionClient to use a ServiceClient instead of
DruidLeaderClient.
Rework LocalIntermediaryDataManager, TaskMonitor, and
ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and
Overlord no longer need IndexingServiceClient (which internally used
DruidLeaderClient).

There are some concrete benefits over the prior logic, namely:

DruidLeaderClient does retries in its "go" method, but only retries
exactly 5 times, does not sleep between retries, and does not retry
retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.)
ServiceClient handles retries in a more reasonable way.
DruidLeaderClient's methods are all synchronous, whereas ServiceClient
methods are asynchronous. This is used in one place so far: the
SpecificTaskServiceLocator, so we don't need to block a thread trying
to locate a task. It can be used in other places in the future.
HttpIndexingServiceClient does not properly handle all server errors.
In some cases, it tries to parse a server error as a successful
response (for example: in getTaskStatus).
IndexTaskClient currently makes an Overlord call on every task-to-task
HTTP request, as a way to find where the target task is. ServiceClient,
through SpecificTaskServiceLocator, caches these target locations
for a period of time.

Our servers talk to each other over HTTP. We have a low-level HTTP client (HttpClient) that is super-asynchronous and super-customizable through its handlers. It's also proven to be quite robust: we use it for Broker -> Historical communication over the wide variety of query types and workloads we support. But the low-level client has no facilities for service location or retries, which means we have a variety of high-level clients that implement these in their own ways. Some high-level clients do a better job than others. This patch adds a mid-level ServiceClient that makes it easier for high-level clients to be built correctly and harmoniously, and migrates some of the high-level logic to use ServiceClients. Main changes: 1) Add ServiceClient org.apache.druid.rpc package. That package also contains supporting stuff like ServiceLocator and RetryPolicy interfaces, and a DiscoveryServiceLocator based on DruidNodeDiscoveryProvider. 2) Add high-level OverlordClient in org.apache.druid.rpc.indexing. 3) Indexing task client creator in TaskServiceClients. It uses SpecificTaskServiceLocator to find the tasks. This improves on ClientInfoTaskProvider by caching task locations for up to 30 seconds across calls, reducing load on the Overlord. 4) Rework ParallelIndexSupervisorTaskClient to use a ServiceClient instead of extending IndexTaskClient. 5) Rework RemoteTaskActionClient to use a ServiceClient instead of DruidLeaderClient. 6) Rework LocalIntermediaryDataManager, TaskMonitor, and ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and Overlord no longer need IndexingServiceClient (which internally used DruidLeaderClient). There are some concrete benefits over the prior logic, namely: - DruidLeaderClient does retries in its "go" method, but only retries exactly 5 times, does not sleep between retries, and does not retry retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.) ServiceClient handles retries in a more reasonable way. - DruidLeaderClient's methods are all synchronous, whereas ServiceClient methods are asynchronous. This is used in one place so far: the SpecificTaskServiceLocator, so we don't need to block a thread trying to locate a task. It can be used in other places in the future. - HttpIndexingServiceClient does not properly handle all server errors. In some cases, it tries to parse a server error as a successful response (for example: in getTaskStatus). - IndexTaskClient currently makes an Overlord call on every task-to-task HTTP request, as a way to find where the target task is. ServiceClient, through SpecificTaskServiceLocator, caches these target locations for a period of time.

samarthjain

Cursory glance at first. Will dig deeper in a bit.

samarthjain · 2022-06-24T05:37:25Z

+      if (cancelIfInterrupted) {
+        future.cancel(true);
+      }
+


Interrupt status probably needs to be set again by calling Thread.currentThread().interrupt()

The InterruptedException is re-thrown here, so it's ok that we don't set the flag. According to https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html it is preferred to not set the interrupt flag when throwing InterruptedException.

samarthjain · 2022-06-24T05:37:46Z

+      return FutureUtils.get(future, cancelIfInterrupted);
+    }
+    catch (InterruptedException e) {
+      throw new RuntimeException(e);


Interrupt status probably needs to be set again by calling Thread.currentThread().interrupt()

Good call, I added a line to set the flag here.

maytasm

LGTM. Reviewed on a high level + key changed/added classes

cryptoe

Minor comments. LGTM otherwise.

cryptoe · 2022-07-04T11:11:03Z

-      throw new RuntimeException(e);
-    }
-  }
+  void report(SubTaskReport report);


Super Nit: as this is an interface, IMHO we should add documentation to this method

Sorry, I missed this comment prior to committing. I agree. However, the old method didn't have a javadoc either, so I don't think I made things worse.

cryptoe · 2022-07-05T05:22:41Z

+/**
+ * Production implementation of {@link ServiceClient}.
+ */
+public class ServiceClientImpl implements ServiceClient


Should we implement this in an autoCloseable?
In case of a close/shutdown or an error condition

we should not schedule new async requests (infinite retry case)

Wait for existing requests to either timeout or abort them. I guess that would depend if the shutdown is gracefully or not.

Sorry, I missed this comment prior to committing. In the current design, the way you do this is to close the ServiceLocator. The ServiceLocator is marked Closeable, and once it's closed, any ServiceClients using it will stop doing retries and stop allowing new requests. The ServiceClient is meant to be stateless.

Continuing the work from apache#12696, this patch removes HttpIndexingServiceClient and the IndexingService flavor of DruidLeaderClient completely. All remaining usages are migrated to OverlordClient. Supporting changes include: 1) Add a variety of methods to OverlordClient. 2) Update MetadataTaskStorage to skip the complete-task lookup when the caller requests zero completed tasks. This helps performance of the "get active tasks" APIs, which don't want to see complete ones.

* Use OverlordClient for all Overlord RPCs. Continuing the work from #12696, this patch removes HttpIndexingServiceClient and the IndexingService flavor of DruidLeaderClient completely. All remaining usages are migrated to OverlordClient. Supporting changes include: 1) Add a variety of methods to OverlordClient. 2) Update MetadataTaskStorage to skip the complete-task lookup when the caller requests zero completed tasks. This helps performance of the "get active tasks" APIs, which don't want to see complete ones. * Use less forbidden APIs. * Fixes from CI. * Add test coverage. * Two more tests. * Fix test. * Updates from CR. * Remove unthrown exceptions. * Refactor to improve testability and test coverage. * Add isNil tests. * Remove unnecessary "deserialize" methods.

Continuing the work from apache#12696, this patch merges the MSQ CoordinatorServiceClient into the core CoordinatorClient, yielding a single interface that serves both needs and is based on the ServiceClient RPC system rather than DruidLeaderClient. Also removes the backwards-compatibility code for the handoff API in CoordinatorBasedSegmentHandoffNotifier, because the new API was added in 0.14.0. That's long enough ago that we don't need backwards compatibility for rolling updates.

* Merge core CoordinatorClient with MSQ CoordinatorServiceClient. Continuing the work from #12696, this patch merges the MSQ CoordinatorServiceClient into the core CoordinatorClient, yielding a single interface that serves both needs and is based on the ServiceClient RPC system rather than DruidLeaderClient. Also removes the backwards-compatibility code for the handoff API in CoordinatorBasedSegmentHandoffNotifier, because the new API was added in 0.14.0. That's long enough ago that we don't need backwards compatibility for rolling updates. * Fixups. * Trigger GHA. * Remove unnecessary retrying in DruidInputSource. Add "about an hour" retry policy and h * EasyMock

Continuing the work from apache#12696. Also in this patch: 1) Extract the nice request builder stuff from the SeekableStream async client, and put it in a more broadly useful ServiceCallBuilder. 2) Slight behavior change to ServiceClient#request; switch to unchecked get instead of checked get. Unchecked get is more commonly wanted. Callers can get checked get if they need it by using asyncRequest.

gianm mentioned this pull request Jun 23, 2022

Multi-stage distributed queries #12262

Closed

kfaraz added the Area - Operations label Jun 24, 2022

gianm added 3 commits June 24, 2022 09:22

Style adjustments.

462fac4

For the coverage.

2b45c4b

Adjustments.

48d0b3c

samarthjain reviewed Jun 24, 2022

View reviewed changes

gianm added 2 commits June 25, 2022 01:23

Better behaviors.

cd3bb9c

Fixes.

bd2949c

abhishekagarwal87 added the Area - Ingestion label Jun 28, 2022

Merge branch 'master' into rpc-service-client

0fdbcdf

maytasm approved these changes Jul 4, 2022

View reviewed changes

cryptoe approved these changes Jul 5, 2022

View reviewed changes

gianm merged commit 2b33018 into apache:master Jul 5, 2022

gianm deleted the rpc-service-client branch July 5, 2022 16:43

abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022

gianm mentioned this pull request Apr 20, 2023

Refresh DruidLeaderClient cache selectively for non-200 responses #14092

Merged

9 tasks

gianm mentioned this pull request Jul 13, 2023

Use OverlordClient for all Overlord RPCs. #14581

Merged

gianm mentioned this pull request Jul 25, 2023

Merge core CoordinatorClient with MSQ CoordinatorServiceClient. #14652

Merged

gianm mentioned this pull request Jul 28, 2023

Move LookupReferencesManager to ServiceClient. #14691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mid-level service client and updated high-level clients.#12696

Mid-level service client and updated high-level clients.#12696
gianm merged 7 commits intoapache:masterfrom
gianm:rpc-service-client

gianm commented Jun 23, 2022

Uh oh!

samarthjain left a comment

Uh oh!

samarthjain Jun 24, 2022

Uh oh!

gianm Jun 28, 2022

Uh oh!

samarthjain Jun 24, 2022

Uh oh!

gianm Jun 28, 2022

Uh oh!

maytasm left a comment

Uh oh!

cryptoe left a comment

Uh oh!

cryptoe Jul 4, 2022 •

edited

Loading

Uh oh!

gianm Jul 5, 2022

Uh oh!

cryptoe Jul 5, 2022

Uh oh!

gianm Jul 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

gianm commented Jun 23, 2022

Uh oh!

samarthjain left a comment

Choose a reason for hiding this comment

Uh oh!

samarthjain Jun 24, 2022

Choose a reason for hiding this comment

Uh oh!

gianm Jun 28, 2022

Choose a reason for hiding this comment

Uh oh!

samarthjain Jun 24, 2022

Choose a reason for hiding this comment

Uh oh!

gianm Jun 28, 2022

Choose a reason for hiding this comment

Uh oh!

maytasm left a comment

Choose a reason for hiding this comment

Uh oh!

cryptoe left a comment

Choose a reason for hiding this comment

Uh oh!

cryptoe Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

cryptoe Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

gianm Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cryptoe Jul 4, 2022 •

edited

Loading