Skip to content

RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers.#14545

Merged
gianm merged 1 commit intoapache:masterfrom
gianm:xrtr-refine-lazy
Jul 7, 2023
Merged

RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers.#14545
gianm merged 1 commit intoapache:masterfrom
gianm:xrtr-refine-lazy

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Jul 7, 2023

Recently #14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers starts out empty. Unfortunately, even after that patch, there remained a more general version of this problem when maxLazyWorkers == lazyWorkers.size(). This patch fixes it.

I'm not sure if this would actually happen in production, because the provisioning strategies do try to avoid calling markWorkersLazy until previously-initiated terminations have finished. Nevertheless, it still seems like a good thing to fix.

Recently apache#14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers
starts out empty. Unfortunately, even after that patch, there remained
a more general version of this problem when maxLazyWorkers == lazyWorkers.size().
This patch fixes it.

I'm not sure if this would actually happen in production, because the
provisioning strategies do try to avoid calling markWorkersLazy until
previously-initiated terminations have finished. Nevertheless, it still
seems like a good thing to fix.
@gianm gianm added the Bug label Jul 7, 2023
if (maxLazyWorkers < 1) {
return Collections.emptyList();
if (lazyWorkers.size() >= maxLazyWorkers) {
return getLazyWorkers();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return a sub-list here of size maxLazyWorkers. Otherwise, the strategy could potentially terminate all the lazy workers and not just those which are in excess.

This would probably be okay if the value of maxLazyWorkers passed was always the same. In that case, we would only ever mark a worker as lazy if it was in excess of maxLazyWorkers. But if the value of maxLazyWorkers decreases in subsequent invocations of this method, we would do some extra terminations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want to terminate all the lazy workers, because nothing ever goes back and marks lazy workers as non-lazy. It is a permanent status unless there is an OL leader change. So, if we don't terminate them, they will just sit around not getting any work assigned to them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. I guess it makes sense to kill it then.

There is however HttpRemoteTaskRunner.syncMonitoring where we reset workers which seem to be acting weird. That flow does remove items from the lazy list, to do another fresh retry of syncing to the worker. In the absence of the reset logic, the lazy flow seems a little self-fulfilling and would lead to eager terminations. If we mark a worker as lazy, and never assign it anything, it will always be lazy (going by the logic in ProvisioningUtil.createLazyWorkerPredicate(config) as an example).

Would it be better to have a time out after which we retry submitting tasks to the worker, and only after a few repeated retries, we truly mark the worker for termination?
So lazy becomes more of a temporary state and a repeatedly lazy servers finally gets black-listed.

(HttpRemoteTaskRunner also already has a list of blacklistedWorkers which seems to be doing its own thing. 😅).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, okay, nothing except the reset on desync, of course 🙂. And also a worker going completely offline and then coming back— that would also get it removed from the lazy list. So, it can happen, but not during normal operation for a stable worker.

The original idea behind "lazy" workers is that they should be terminated ASAP once in the lazy state. The terminology is mostly a joke about how those workers are lazy and haven't done any work in a while. Before being marked "lazy", they would have been in the state "idle" (not running any tasks) for the workerIdleTimeout.

On blacklistedWorkers, that's a different thing, meant to catch workers that are having problems running tasks, so we don't schedule tasks on them and have them keep failing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for the clarification and for indulging me, 🙂.

@gianm gianm merged commit 021a01d into apache:master Jul 7, 2023
@gianm gianm deleted the xrtr-refine-lazy branch July 7, 2023 17:08
@AmatyaAvadhanula AmatyaAvadhanula added this to the 27.0 milestone Jul 19, 2023
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
…ache#14545)

Recently apache#14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers
starts out empty. Unfortunately, even after that patch, there remained
a more general version of this problem when maxLazyWorkers == lazyWorkers.size().
This patch fixes it.

I'm not sure if this would actually happen in production, because the
provisioning strategies do try to avoid calling markWorkersLazy until
previously-initiated terminations have finished. Nevertheless, it still
seems like a good thing to fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants