Mid-level service client and updated high-level clients.#12696
Mid-level service client and updated high-level clients.#12696gianm merged 7 commits intoapache:masterfrom
Conversation
Our servers talk to each other over HTTP. We have a low-level HTTP client (HttpClient) that is super-asynchronous and super-customizable through its handlers. It's also proven to be quite robust: we use it for Broker -> Historical communication over the wide variety of query types and workloads we support. But the low-level client has no facilities for service location or retries, which means we have a variety of high-level clients that implement these in their own ways. Some high-level clients do a better job than others. This patch adds a mid-level ServiceClient that makes it easier for high-level clients to be built correctly and harmoniously, and migrates some of the high-level logic to use ServiceClients. Main changes: 1) Add ServiceClient org.apache.druid.rpc package. That package also contains supporting stuff like ServiceLocator and RetryPolicy interfaces, and a DiscoveryServiceLocator based on DruidNodeDiscoveryProvider. 2) Add high-level OverlordClient in org.apache.druid.rpc.indexing. 3) Indexing task client creator in TaskServiceClients. It uses SpecificTaskServiceLocator to find the tasks. This improves on ClientInfoTaskProvider by caching task locations for up to 30 seconds across calls, reducing load on the Overlord. 4) Rework ParallelIndexSupervisorTaskClient to use a ServiceClient instead of extending IndexTaskClient. 5) Rework RemoteTaskActionClient to use a ServiceClient instead of DruidLeaderClient. 6) Rework LocalIntermediaryDataManager, TaskMonitor, and ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and Overlord no longer need IndexingServiceClient (which internally used DruidLeaderClient). There are some concrete benefits over the prior logic, namely: - DruidLeaderClient does retries in its "go" method, but only retries exactly 5 times, does not sleep between retries, and does not retry retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.) ServiceClient handles retries in a more reasonable way. - DruidLeaderClient's methods are all synchronous, whereas ServiceClient methods are asynchronous. This is used in one place so far: the SpecificTaskServiceLocator, so we don't need to block a thread trying to locate a task. It can be used in other places in the future. - HttpIndexingServiceClient does not properly handle all server errors. In some cases, it tries to parse a server error as a successful response (for example: in getTaskStatus). - IndexTaskClient currently makes an Overlord call on every task-to-task HTTP request, as a way to find where the target task is. ServiceClient, through SpecificTaskServiceLocator, caches these target locations for a period of time.
samarthjain
left a comment
There was a problem hiding this comment.
Cursory glance at first. Will dig deeper in a bit.
| if (cancelIfInterrupted) { | ||
| future.cancel(true); | ||
| } | ||
|
|
There was a problem hiding this comment.
Interrupt status probably needs to be set again by calling Thread.currentThread().interrupt()
There was a problem hiding this comment.
The InterruptedException is re-thrown here, so it's ok that we don't set the flag. According to https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html it is preferred to not set the interrupt flag when throwing InterruptedException.
| return FutureUtils.get(future, cancelIfInterrupted); | ||
| } | ||
| catch (InterruptedException e) { | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
Interrupt status probably needs to be set again by calling Thread.currentThread().interrupt()
There was a problem hiding this comment.
Good call, I added a line to set the flag here.
maytasm
left a comment
There was a problem hiding this comment.
LGTM. Reviewed on a high level + key changed/added classes
cryptoe
left a comment
There was a problem hiding this comment.
Minor comments. LGTM otherwise.
| throw new RuntimeException(e); | ||
| } | ||
| } | ||
| void report(SubTaskReport report); |
There was a problem hiding this comment.
Super Nit: as this is an interface, IMHO we should add documentation to this method
There was a problem hiding this comment.
Sorry, I missed this comment prior to committing. I agree. However, the old method didn't have a javadoc either, so I don't think I made things worse.
| /** | ||
| * Production implementation of {@link ServiceClient}. | ||
| */ | ||
| public class ServiceClientImpl implements ServiceClient |
There was a problem hiding this comment.
Should we implement this in an autoCloseable?
In case of a close/shutdown or an error condition
- we should not schedule new async requests (infinite retry case)
- Wait for existing requests to either timeout or abort them. I guess that would depend if the shutdown is gracefully or not.
There was a problem hiding this comment.
Sorry, I missed this comment prior to committing. In the current design, the way you do this is to close the ServiceLocator. The ServiceLocator is marked Closeable, and once it's closed, any ServiceClients using it will stop doing retries and stop allowing new requests. The ServiceClient is meant to be stateless.
Continuing the work from apache#12696, this patch removes HttpIndexingServiceClient and the IndexingService flavor of DruidLeaderClient completely. All remaining usages are migrated to OverlordClient. Supporting changes include: 1) Add a variety of methods to OverlordClient. 2) Update MetadataTaskStorage to skip the complete-task lookup when the caller requests zero completed tasks. This helps performance of the "get active tasks" APIs, which don't want to see complete ones.
* Use OverlordClient for all Overlord RPCs. Continuing the work from #12696, this patch removes HttpIndexingServiceClient and the IndexingService flavor of DruidLeaderClient completely. All remaining usages are migrated to OverlordClient. Supporting changes include: 1) Add a variety of methods to OverlordClient. 2) Update MetadataTaskStorage to skip the complete-task lookup when the caller requests zero completed tasks. This helps performance of the "get active tasks" APIs, which don't want to see complete ones. * Use less forbidden APIs. * Fixes from CI. * Add test coverage. * Two more tests. * Fix test. * Updates from CR. * Remove unthrown exceptions. * Refactor to improve testability and test coverage. * Add isNil tests. * Remove unnecessary "deserialize" methods.
Continuing the work from apache#12696, this patch merges the MSQ CoordinatorServiceClient into the core CoordinatorClient, yielding a single interface that serves both needs and is based on the ServiceClient RPC system rather than DruidLeaderClient. Also removes the backwards-compatibility code for the handoff API in CoordinatorBasedSegmentHandoffNotifier, because the new API was added in 0.14.0. That's long enough ago that we don't need backwards compatibility for rolling updates.
* Merge core CoordinatorClient with MSQ CoordinatorServiceClient. Continuing the work from #12696, this patch merges the MSQ CoordinatorServiceClient into the core CoordinatorClient, yielding a single interface that serves both needs and is based on the ServiceClient RPC system rather than DruidLeaderClient. Also removes the backwards-compatibility code for the handoff API in CoordinatorBasedSegmentHandoffNotifier, because the new API was added in 0.14.0. That's long enough ago that we don't need backwards compatibility for rolling updates. * Fixups. * Trigger GHA. * Remove unnecessary retrying in DruidInputSource. Add "about an hour" retry policy and h * EasyMock
Continuing the work from apache#12696. Also in this patch: 1) Extract the nice request builder stuff from the SeekableStream async client, and put it in a more broadly useful ServiceCallBuilder. 2) Slight behavior change to ServiceClient#request; switch to unchecked get instead of checked get. Unchecked get is more commonly wanted. Callers can get checked get if they need it by using asyncRequest.
Our servers talk to each other over HTTP. We have a low-level HTTP
client (HttpClient) that is super-asynchronous and super-customizable
through its handlers. It's also proven to be quite robust: we use it
for Broker -> Historical communication over the wide variety of query
types and workloads we support.
But the low-level client has no facilities for service location or
retries, which means we have a variety of high-level clients that
implement these in their own ways. Some high-level clients do a better
job than others. This patch adds a mid-level ServiceClient that makes
it easier for high-level clients to be built correctly and harmoniously,
and migrates some of the high-level logic to use ServiceClients.
Main changes:
Add ServiceClient org.apache.druid.rpc package. That package also
contains supporting stuff like ServiceLocator and RetryPolicy
interfaces, and a DiscoveryServiceLocator based on
DruidNodeDiscoveryProvider.
Add high-level OverlordClient in org.apache.druid.rpc.indexing.
Indexing task client creator in TaskServiceClients. It uses
SpecificTaskServiceLocator to find the tasks. This improves on
ClientInfoTaskProvider by caching task locations for up to 30 seconds
across calls, reducing load on the Overlord.
Rework ParallelIndexSupervisorTaskClient to use a ServiceClient
instead of extending IndexTaskClient.
Rework RemoteTaskActionClient to use a ServiceClient instead of
DruidLeaderClient.
Rework LocalIntermediaryDataManager, TaskMonitor, and
ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and
Overlord no longer need IndexingServiceClient (which internally used
DruidLeaderClient).
There are some concrete benefits over the prior logic, namely:
DruidLeaderClient does retries in its "go" method, but only retries
exactly 5 times, does not sleep between retries, and does not retry
retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.)
ServiceClient handles retries in a more reasonable way.
DruidLeaderClient's methods are all synchronous, whereas ServiceClient
methods are asynchronous. This is used in one place so far: the
SpecificTaskServiceLocator, so we don't need to block a thread trying
to locate a task. It can be used in other places in the future.
HttpIndexingServiceClient does not properly handle all server errors.
In some cases, it tries to parse a server error as a successful
response (for example: in getTaskStatus).
IndexTaskClient currently makes an Overlord call on every task-to-task
HTTP request, as a way to find where the target task is. ServiceClient,
through SpecificTaskServiceLocator, caches these target locations
for a period of time.