RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers.#14545
RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers.#14545gianm merged 1 commit intoapache:masterfrom
Conversation
Recently apache#14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers starts out empty. Unfortunately, even after that patch, there remained a more general version of this problem when maxLazyWorkers == lazyWorkers.size(). This patch fixes it. I'm not sure if this would actually happen in production, because the provisioning strategies do try to avoid calling markWorkersLazy until previously-initiated terminations have finished. Nevertheless, it still seems like a good thing to fix.
| if (maxLazyWorkers < 1) { | ||
| return Collections.emptyList(); | ||
| if (lazyWorkers.size() >= maxLazyWorkers) { | ||
| return getLazyWorkers(); |
There was a problem hiding this comment.
I think we should return a sub-list here of size maxLazyWorkers. Otherwise, the strategy could potentially terminate all the lazy workers and not just those which are in excess.
This would probably be okay if the value of maxLazyWorkers passed was always the same. In that case, we would only ever mark a worker as lazy if it was in excess of maxLazyWorkers. But if the value of maxLazyWorkers decreases in subsequent invocations of this method, we would do some extra terminations.
There was a problem hiding this comment.
I think we do want to terminate all the lazy workers, because nothing ever goes back and marks lazy workers as non-lazy. It is a permanent status unless there is an OL leader change. So, if we don't terminate them, they will just sit around not getting any work assigned to them.
There was a problem hiding this comment.
Hmm, I see. I guess it makes sense to kill it then.
There is however HttpRemoteTaskRunner.syncMonitoring where we reset workers which seem to be acting weird. That flow does remove items from the lazy list, to do another fresh retry of syncing to the worker. In the absence of the reset logic, the lazy flow seems a little self-fulfilling and would lead to eager terminations. If we mark a worker as lazy, and never assign it anything, it will always be lazy (going by the logic in ProvisioningUtil.createLazyWorkerPredicate(config) as an example).
Would it be better to have a time out after which we retry submitting tasks to the worker, and only after a few repeated retries, we truly mark the worker for termination?
So lazy becomes more of a temporary state and a repeatedly lazy servers finally gets black-listed.
(HttpRemoteTaskRunner also already has a list of blacklistedWorkers which seems to be doing its own thing. 😅).
There was a problem hiding this comment.
Ah, yeah, okay, nothing except the reset on desync, of course 🙂. And also a worker going completely offline and then coming back— that would also get it removed from the lazy list. So, it can happen, but not during normal operation for a stable worker.
The original idea behind "lazy" workers is that they should be terminated ASAP once in the lazy state. The terminology is mostly a joke about how those workers are lazy and haven't done any work in a while. Before being marked "lazy", they would have been in the state "idle" (not running any tasks) for the workerIdleTimeout.
On blacklistedWorkers, that's a different thing, meant to catch workers that are having problems running tasks, so we don't schedule tasks on them and have them keep failing.
There was a problem hiding this comment.
Makes sense. Thanks for the clarification and for indulging me, 🙂.
…ache#14545) Recently apache#14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers starts out empty. Unfortunately, even after that patch, there remained a more general version of this problem when maxLazyWorkers == lazyWorkers.size(). This patch fixes it. I'm not sure if this would actually happen in production, because the provisioning strategies do try to avoid calling markWorkersLazy until previously-initiated terminations have finished. Nevertheless, it still seems like a good thing to fix.
Recently #14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers starts out empty. Unfortunately, even after that patch, there remained a more general version of this problem when maxLazyWorkers == lazyWorkers.size(). This patch fixes it.
I'm not sure if this would actually happen in production, because the provisioning strategies do try to avoid calling markWorkersLazy until previously-initiated terminations have finished. Nevertheless, it still seems like a good thing to fix.