RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers. by gianm · Pull Request #14545 · apache/druid

gianm · 2023-07-07T08:52:21Z

Recently #14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers starts out empty. Unfortunately, even after that patch, there remained a more general version of this problem when maxLazyWorkers == lazyWorkers.size(). This patch fixes it.

I'm not sure if this would actually happen in production, because the provisioning strategies do try to avoid calling markWorkersLazy until previously-initiated terminations have finished. Nevertheless, it still seems like a good thing to fix.

Recently apache#14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers starts out empty. Unfortunately, even after that patch, there remained a more general version of this problem when maxLazyWorkers == lazyWorkers.size(). This patch fixes it. I'm not sure if this would actually happen in production, because the provisioning strategies do try to avoid calling markWorkersLazy until previously-initiated terminations have finished. Nevertheless, it still seems like a good thing to fix.

kfaraz · 2023-07-07T09:39:25Z

-    if (maxLazyWorkers < 1) {
-      return Collections.emptyList();
+    if (lazyWorkers.size() >= maxLazyWorkers) {
+      return getLazyWorkers();


I think we should return a sub-list here of size maxLazyWorkers. Otherwise, the strategy could potentially terminate all the lazy workers and not just those which are in excess.

This would probably be okay if the value of maxLazyWorkers passed was always the same. In that case, we would only ever mark a worker as lazy if it was in excess of maxLazyWorkers. But if the value of maxLazyWorkers decreases in subsequent invocations of this method, we would do some extra terminations.

I think we do want to terminate all the lazy workers, because nothing ever goes back and marks lazy workers as non-lazy. It is a permanent status unless there is an OL leader change. So, if we don't terminate them, they will just sit around not getting any work assigned to them.

Hmm, I see. I guess it makes sense to kill it then.

There is however HttpRemoteTaskRunner.syncMonitoring where we reset workers which seem to be acting weird. That flow does remove items from the lazy list, to do another fresh retry of syncing to the worker. In the absence of the reset logic, the lazy flow seems a little self-fulfilling and would lead to eager terminations. If we mark a worker as lazy, and never assign it anything, it will always be lazy (going by the logic in ProvisioningUtil.createLazyWorkerPredicate(config) as an example).

Would it be better to have a time out after which we retry submitting tasks to the worker, and only after a few repeated retries, we truly mark the worker for termination?
So lazy becomes more of a temporary state and a repeatedly lazy servers finally gets black-listed.

(HttpRemoteTaskRunner also already has a list of blacklistedWorkers which seems to be doing its own thing. 😅).

Ah, yeah, okay, nothing except the reset on desync, of course 🙂. And also a worker going completely offline and then coming back— that would also get it removed from the lazy list. So, it can happen, but not during normal operation for a stable worker.

The original idea behind "lazy" workers is that they should be terminated ASAP once in the lazy state. The terminology is mostly a joke about how those workers are lazy and haven't done any work in a while. Before being marked "lazy", they would have been in the state "idle" (not running any tasks) for the workerIdleTimeout.

On blacklistedWorkers, that's a different thing, meant to catch workers that are having problems running tasks, so we don't schedule tasks on them and have them keep failing.

Makes sense. Thanks for the clarification and for indulging me, 🙂.

…ache#14545) Recently apache#14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers starts out empty. Unfortunately, even after that patch, there remained a more general version of this problem when maxLazyWorkers == lazyWorkers.size(). This patch fixes it. I'm not sure if this would actually happen in production, because the provisioning strategies do try to avoid calling markWorkersLazy until previously-initiated terminations have finished. Nevertheless, it still seems like a good thing to fix.

gianm added the Bug label Jul 7, 2023

kfaraz reviewed Jul 7, 2023

View reviewed changes

kfaraz approved these changes Jul 7, 2023

View reviewed changes

gianm merged commit 021a01d into apache:master Jul 7, 2023

gianm deleted the xrtr-refine-lazy branch July 7, 2023 17:08

AmatyaAvadhanula added this to the 27.0 milestone Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers.#14545

RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers.#14545
gianm merged 1 commit intoapache:masterfrom
gianm:xrtr-refine-lazy

gianm commented Jul 7, 2023

Uh oh!

kfaraz Jul 7, 2023

Uh oh!

gianm Jul 7, 2023

Uh oh!

kfaraz Jul 7, 2023

Uh oh!

gianm Jul 7, 2023

Uh oh!

kfaraz Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gianm commented Jul 7, 2023

Uh oh!

kfaraz Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

gianm Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

kfaraz Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

gianm Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

kfaraz Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants