Skip to content

Mid-level service client and updated high-level clients.#12696

Merged
gianm merged 7 commits intoapache:masterfrom
gianm:rpc-service-client
Jul 5, 2022
Merged

Mid-level service client and updated high-level clients.#12696
gianm merged 7 commits intoapache:masterfrom
gianm:rpc-service-client

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Jun 23, 2022

Our servers talk to each other over HTTP. We have a low-level HTTP
client (HttpClient) that is super-asynchronous and super-customizable
through its handlers. It's also proven to be quite robust: we use it
for Broker -> Historical communication over the wide variety of query
types and workloads we support.

But the low-level client has no facilities for service location or
retries, which means we have a variety of high-level clients that
implement these in their own ways. Some high-level clients do a better
job than others. This patch adds a mid-level ServiceClient that makes
it easier for high-level clients to be built correctly and harmoniously,
and migrates some of the high-level logic to use ServiceClients.

Main changes:

  1. Add ServiceClient org.apache.druid.rpc package. That package also
    contains supporting stuff like ServiceLocator and RetryPolicy
    interfaces, and a DiscoveryServiceLocator based on
    DruidNodeDiscoveryProvider.

  2. Add high-level OverlordClient in org.apache.druid.rpc.indexing.

  3. Indexing task client creator in TaskServiceClients. It uses
    SpecificTaskServiceLocator to find the tasks. This improves on
    ClientInfoTaskProvider by caching task locations for up to 30 seconds
    across calls, reducing load on the Overlord.

  4. Rework ParallelIndexSupervisorTaskClient to use a ServiceClient
    instead of extending IndexTaskClient.

  5. Rework RemoteTaskActionClient to use a ServiceClient instead of
    DruidLeaderClient.

  6. Rework LocalIntermediaryDataManager, TaskMonitor, and
    ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and
    Overlord no longer need IndexingServiceClient (which internally used
    DruidLeaderClient).

There are some concrete benefits over the prior logic, namely:

  • DruidLeaderClient does retries in its "go" method, but only retries
    exactly 5 times, does not sleep between retries, and does not retry
    retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.)
    ServiceClient handles retries in a more reasonable way.

  • DruidLeaderClient's methods are all synchronous, whereas ServiceClient
    methods are asynchronous. This is used in one place so far: the
    SpecificTaskServiceLocator, so we don't need to block a thread trying
    to locate a task. It can be used in other places in the future.

  • HttpIndexingServiceClient does not properly handle all server errors.
    In some cases, it tries to parse a server error as a successful
    response (for example: in getTaskStatus).

  • IndexTaskClient currently makes an Overlord call on every task-to-task
    HTTP request, as a way to find where the target task is. ServiceClient,
    through SpecificTaskServiceLocator, caches these target locations
    for a period of time.

Our servers talk to each other over HTTP. We have a low-level HTTP
client (HttpClient) that is super-asynchronous and super-customizable
through its handlers. It's also proven to be quite robust: we use it
for Broker -> Historical communication over the wide variety of query
types and workloads we support.

But the low-level client has no facilities for service location or
retries, which means we have a variety of high-level clients that
implement these in their own ways. Some high-level clients do a better
job than others. This patch adds a mid-level ServiceClient that makes
it easier for high-level clients to be built correctly and harmoniously,
and migrates some of the high-level logic to use ServiceClients.

Main changes:

1) Add ServiceClient org.apache.druid.rpc package. That package also
   contains supporting stuff like ServiceLocator and RetryPolicy
   interfaces, and a DiscoveryServiceLocator based on
   DruidNodeDiscoveryProvider.

2) Add high-level OverlordClient in org.apache.druid.rpc.indexing.

3) Indexing task client creator in TaskServiceClients. It uses
   SpecificTaskServiceLocator to find the tasks. This improves on
   ClientInfoTaskProvider by caching task locations for up to 30 seconds
   across calls, reducing load on the Overlord.

4) Rework ParallelIndexSupervisorTaskClient to use a ServiceClient
   instead of extending IndexTaskClient.

5) Rework RemoteTaskActionClient to use a ServiceClient instead of
   DruidLeaderClient.

6) Rework LocalIntermediaryDataManager, TaskMonitor, and
   ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and
   Overlord no longer need IndexingServiceClient (which internally used
   DruidLeaderClient).

There are some concrete benefits over the prior logic, namely:

- DruidLeaderClient does retries in its "go" method, but only retries
  exactly 5 times, does not sleep between retries, and does not retry
  retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.)
  ServiceClient handles retries in a more reasonable way.

- DruidLeaderClient's methods are all synchronous, whereas ServiceClient
  methods are asynchronous. This is used in one place so far: the
  SpecificTaskServiceLocator, so we don't need to block a thread trying
  to locate a task. It can be used in other places in the future.

- HttpIndexingServiceClient does not properly handle all server errors.
  In some cases, it tries to parse a server error as a successful
  response (for example: in getTaskStatus).

- IndexTaskClient currently makes an Overlord call on every task-to-task
  HTTP request, as a way to find where the target task is. ServiceClient,
  through SpecificTaskServiceLocator, caches these target locations
  for a period of time.
Copy link
Copy Markdown
Contributor

@samarthjain samarthjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory glance at first. Will dig deeper in a bit.

if (cancelIfInterrupted) {
future.cancel(true);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interrupt status probably needs to be set again by calling Thread.currentThread().interrupt()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InterruptedException is re-thrown here, so it's ok that we don't set the flag. According to https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html it is preferred to not set the interrupt flag when throwing InterruptedException.

return FutureUtils.get(future, cancelIfInterrupted);
}
catch (InterruptedException e) {
throw new RuntimeException(e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interrupt status probably needs to be set again by calling Thread.currentThread().interrupt()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I added a line to set the flag here.

Copy link
Copy Markdown
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Reviewed on a high level + key changed/added classes

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM otherwise.

throw new RuntimeException(e);
}
}
void report(SubTaskReport report);
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super Nit: as this is an interface, IMHO we should add documentation to this method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this comment prior to committing. I agree. However, the old method didn't have a javadoc either, so I don't think I made things worse.

/**
* Production implementation of {@link ServiceClient}.
*/
public class ServiceClientImpl implements ServiceClient
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement this in an autoCloseable?
In case of a close/shutdown or an error condition

  1. we should not schedule new async requests (infinite retry case)
  2. Wait for existing requests to either timeout or abort them. I guess that would depend if the shutdown is gracefully or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this comment prior to committing. In the current design, the way you do this is to close the ServiceLocator. The ServiceLocator is marked Closeable, and once it's closed, any ServiceClients using it will stop doing retries and stop allowing new requests. The ServiceClient is meant to be stateless.

@gianm gianm merged commit 2b33018 into apache:master Jul 5, 2022
@gianm gianm deleted the rpc-service-client branch July 5, 2022 16:43
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
gianm added a commit to gianm/druid that referenced this pull request Jul 13, 2023
Continuing the work from apache#12696, this patch removes HttpIndexingServiceClient
and the IndexingService flavor of DruidLeaderClient completely. All remaining
usages are migrated to OverlordClient.

Supporting changes include:

1) Add a variety of methods to OverlordClient.

2) Update MetadataTaskStorage to skip the complete-task lookup when
   the caller requests zero completed tasks. This helps performance of
   the "get active tasks" APIs, which don't want to see complete ones.
gianm added a commit that referenced this pull request Jul 25, 2023
* Use OverlordClient for all Overlord RPCs.

Continuing the work from #12696, this patch removes HttpIndexingServiceClient
and the IndexingService flavor of DruidLeaderClient completely. All remaining
usages are migrated to OverlordClient.

Supporting changes include:

1) Add a variety of methods to OverlordClient.

2) Update MetadataTaskStorage to skip the complete-task lookup when
   the caller requests zero completed tasks. This helps performance of
   the "get active tasks" APIs, which don't want to see complete ones.

* Use less forbidden APIs.

* Fixes from CI.

* Add test coverage.

* Two more tests.

* Fix test.

* Updates from CR.

* Remove unthrown exceptions.

* Refactor to improve testability and test coverage.

* Add isNil tests.

* Remove unnecessary "deserialize" methods.
gianm added a commit to gianm/druid that referenced this pull request Jul 25, 2023
Continuing the work from apache#12696, this patch merges the MSQ
CoordinatorServiceClient into the core CoordinatorClient, yielding a single
interface that serves both needs and is based on the ServiceClient RPC
system rather than DruidLeaderClient.

Also removes the backwards-compatibility code for the handoff API in
CoordinatorBasedSegmentHandoffNotifier, because the new API was added
in 0.14.0. That's long enough ago that we don't need backwards
compatibility for rolling updates.
gianm added a commit that referenced this pull request Jul 27, 2023
* Merge core CoordinatorClient with MSQ CoordinatorServiceClient.

Continuing the work from #12696, this patch merges the MSQ
CoordinatorServiceClient into the core CoordinatorClient, yielding a single
interface that serves both needs and is based on the ServiceClient RPC
system rather than DruidLeaderClient.

Also removes the backwards-compatibility code for the handoff API in
CoordinatorBasedSegmentHandoffNotifier, because the new API was added
in 0.14.0. That's long enough ago that we don't need backwards
compatibility for rolling updates.

* Fixups.

* Trigger GHA.

* Remove unnecessary retrying in DruidInputSource. Add "about an hour"
retry policy and h

* EasyMock
gianm added a commit to gianm/druid that referenced this pull request Jul 28, 2023
Continuing the work from apache#12696. Also in this patch:

1) Extract the nice request builder stuff from the SeekableStream async
   client, and put it in a more broadly useful ServiceCallBuilder.

2) Slight behavior change to ServiceClient#request; switch to unchecked
   get instead of checked get. Unchecked get is more commonly wanted.
   Callers can get checked get if they need it by using asyncRequest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants