-
Notifications
You must be signed in to change notification settings - Fork 3.8k
RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers. #14545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return a sub-list here of size
maxLazyWorkers. Otherwise, the strategy could potentially terminate all the lazy workers and not just those which are in excess.This would probably be okay if the value of
maxLazyWorkerspassed was always the same. In that case, we would only ever mark a worker as lazy if it was in excess ofmaxLazyWorkers. But if the value ofmaxLazyWorkersdecreases in subsequent invocations of this method, we would do some extra terminations.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want to terminate all the lazy workers, because nothing ever goes back and marks lazy workers as non-lazy. It is a permanent status unless there is an OL leader change. So, if we don't terminate them, they will just sit around not getting any work assigned to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. I guess it makes sense to kill it then.
There is however
HttpRemoteTaskRunner.syncMonitoringwhere we reset workers which seem to be acting weird. That flow does remove items from thelazylist, to do another fresh retry of syncing to the worker. In the absence of the reset logic, the lazy flow seems a little self-fulfilling and would lead to eager terminations. If we mark a worker as lazy, and never assign it anything, it will always be lazy (going by the logic inProvisioningUtil.createLazyWorkerPredicate(config)as an example).Would it be better to have a time out after which we retry submitting tasks to the worker, and only after a few repeated retries, we truly mark the worker for termination?
So
lazybecomes more of a temporary state and a repeatedly lazy servers finally gets black-listed.(
HttpRemoteTaskRunneralso already has a list ofblacklistedWorkerswhich seems to be doing its own thing. 😅).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, okay, nothing except the reset on desync, of course 🙂. And also a worker going completely offline and then coming back— that would also get it removed from the lazy list. So, it can happen, but not during normal operation for a stable worker.
The original idea behind "lazy" workers is that they should be terminated ASAP once in the lazy state. The terminology is mostly a joke about how those workers are lazy and haven't done any work in a while. Before being marked "lazy", they would have been in the state "idle" (not running any tasks) for the
workerIdleTimeout.On
blacklistedWorkers, that's a different thing, meant to catch workers that are having problems running tasks, so we don't schedule tasks on them and have them keep failing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks for the clarification and for indulging me, 🙂.