Overlord to support autoscalers per indexer/middlemanager category#9350
Overlord to support autoscalers per indexer/middlemanager category#9350sascha-coenen wants to merge 38 commits intoapache:masterfrom
Conversation
Initial implementation.
- Changed behavior config structure. - Small refactoring - Extended unit tests
- Extended unit tests
- Extended and refactored unit tests
- Extended and refactored unit tests
- Extended and refactored unit tests
- Fixed formatting
- Fixed unit tests
- Added "category" field into Autoscaler - Changed config format to support the "category" field - Fixed parameters of unit tests
- Fixed formatting
- Put category information into Autoscaler - Changed structure of behavior config - Changed CategoriedProvisioningStrategy accordingly - Fixed bug in PendingTaskBasedWorkerProvisioningStrategy - Extended unit tests - Code cleanup and refactoring
- Code refactoring
- Code refactoring
- Introduced new logic into the existing strategies
- Fixes and refactoring
- Added more test cases - Refactored unit tests
- Fixed code styles
- Used Default Worker Category as a fallback for null worker categories
- Refactored legacy code to avoid NPE - Removed duplicates
- Added more debug info - Removed duplicated code
- Removed debug logging
Overlord can support many autoscalers of different categories apache#8989
|
Just FYI: we had to move this code contribution from one github repo to another one today. |
… Add documentation and javadoc comments. Fixed naming of one class.
… fix spelling in documentation and code to pass spellchecker tests
|
the spellchecker was failing on a naming issue, so I fixed the spelling in the documentation and also changed wrongly spelled type names in the code accordingly. |
|
|
||
| import org.joda.time.Period; | ||
|
|
||
| public class CategorizedProvisioningConfig extends PendingTaskBasedWorkerProvisioningConfig |
There was a problem hiding this comment.
note-to-self: Couldn't understand the point of this class as this is overriding all methods and just calling super.xx(..) , maybe I will understand more as I read rest of the PR.
There was a problem hiding this comment.
It can be removed. Please ignore.
There was a problem hiding this comment.
Thanks Vladi.
Just to update the status: I removed all orhpaned classes.
Changes to the DefaultWorkerBehaviorConfig class are still pending
| private final List<AutoScaler> autoScalers; | ||
|
|
||
| @JsonCreator | ||
| public CategorizedWorkerBehaviorConfig( |
There was a problem hiding this comment.
I think things will be simplified if we removed this class and instead made DefaultWorkerBehaviorConfig have List<AutoScaler> autoScalers with constructor
@JsonCreator
public DefaultWorkerBehaviorConfig(
@JsonProperty("selectStrategy") WorkerSelectStrategy selectStrategy,
@JsonProperty("autoScaler") AutoScaler autoScaler,
@JsonProperty("autoScalers") List<AutoScaler> autoScalers
)
{
this.selectStrategy = selectStrategy;
// fail if both "autoscalers" and "autoscaler" fields are non-null
this.autoScalers = autoScalers or Collections.singletonList(autoscaler);
}
There was a problem hiding this comment.
Do you thing it will not confuse users? I'm about having two sections related to autoscalers in the same config.
I'm ok for that. Initially I also had such decision to just extend DefaultWorkerBehaviorConfig. But later I thought that having two sections for autoscalers can be not good for user experience.
There was a problem hiding this comment.
Actually, We are keeping autoScaler field only for backward compatibility .. we should really deprecate it and remove its mention from the docs. Eventually it should be removed and we should just have autoScalers .
There was a problem hiding this comment.
Ok. No objections. Let's move the autoScalers section to DefaultWorkerBehaviorConfig and remove CategorizedWorkerBehaviorConfig.
There was a problem hiding this comment.
I tried to refactor the code based on the above review comments.
The class CategorizedWorkerBehaviorConfig has been removed, its logic been merged into the DefaultworkerBehaviorConfig.
The constructor has been modified according to himanshug's post above.
One thing I noticed recently, is that the new web-console is not treating the behaviour config as a single json object but has separate text boxes for select strategy and autoscaler.
What shall we do about this?
There was a problem hiding this comment.
That means, users who want to use autoScalers can't use the console UI to set the spec and would instead have to manually send the http request to coordinator using curl or whatever.
after this PR is merged, an issue should be created for console UI updates to make it possible to add autoScalers .
In this PR, we need to make sure that we stay totally backwards compatible.
There was a problem hiding this comment.
I am totally down to update the web console to reflect the new functionality.
| * a worker's category. | ||
| */ | ||
| @JsonTypeName("categorizedTaskBased") | ||
| public class CategorizedWorkerProvisioningStrategy extends AbstractWorkerProvisioningStrategy |
There was a problem hiding this comment.
Can you explain how this is different from PendingTaskBasedWorkerProvisioningStrategy ?
There was a problem hiding this comment.
After reviewing the difference I recalled that it is leftover which can be removed along with CategorizedProvisioningConfig. Originally I separately created the new strategy as an intermediate step to check that I didn't affect the original strategy. So I just forgot to remove it.
There was a problem hiding this comment.
yeah, that is what I guessed :)
categorization really cuts across all implementations of ProvisioningStrategy and hence shouldn't be a separate impl on its own.
There was a problem hiding this comment.
I tried to refactor the code based on the above review comments. The class CategorizedWorkerProvisioningStrategy has been removed.
Co-Authored-By: Himanshu <g.himanshu@gmail.com>
… remove orphaned classes according to findings duriing code review.
…e CategorizedWorkerBehaviourConfig (apache#9350)
…CategorizedWorkerBehaviourConfig class (apache#9350)
himanshug
left a comment
There was a problem hiding this comment.
reviewed some more, will continue.
| |`selectStrategy`|How to assign tasks to MiddleManagers. Choices are `fillCapacity`, `equalDistribution`, and `javascript`.|equalDistribution| | ||
| |`autoScaler`|Only used if autoscaling is enabled. See below.|null| | ||
|
|
||
| ##### Categorized Worker Config |
There was a problem hiding this comment.
this section should be removed and autoScaler be replaced with autoScalers above.
| this.autoScaler = autoScaler; | ||
| this.autoScalers = (autoScaler != null) ? Collections.singletonList(autoScaler) : autoScalers; | ||
| if (this.autoScalers == null) { | ||
| throw new IllegalArgumentException("Either autoScaler or autoScalers property needs to be provided"); |
There was a problem hiding this comment.
can we remove mention of autoScaler property and deprecate it .. so that it gets removed eventually.
There was a problem hiding this comment.
can we add two more checks ?
- make sure both
autoScalerandautoScalersaren't provided - if
autoScalersis provided, make sure each autoScaler element has a different category.
| return null; | ||
| } | ||
| final DefaultWorkerBehaviorConfig workerConfig = (DefaultWorkerBehaviorConfig) workerBehaviorConfig; | ||
| if (workerConfig.getAutoScalers() == null) { |
There was a problem hiding this comment.
DefaultWorkerBehaviorConfig constructor shouldn't/doesn't allow instantiation with null autoScalers so this check is redundant.
| )); | ||
|
|
||
| if (result.size() != autoScalers.size()) { | ||
| log.warn( |
There was a problem hiding this comment.
Can we add the check in DefaultWorkerBehaviorConfig to ensure autoscalers with duplicate category aren't provided and fail? This is probably a case of user making an error in configuration.
There was a problem hiding this comment.
good point. I changed that.
…ord/setup/CategorizedWorkerSelectStrategy.java Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…ourConfig, doc fixes, clean up in ProvisioningUtil (apache#9350)
himanshug
left a comment
There was a problem hiding this comment.
sorry for the delay, I have gone through whole PR now.
| if (autoScaler != null && autoScalers != null) { | ||
| throw new IllegalArgumentException("The autoScaler and autoScalers properties are mutually exclusive"); | ||
| } | ||
| if (autoScaler == null && autoScalers == null) { | ||
| throw new IllegalArgumentException("Either autoScaler or autoScalers property must be provided"); | ||
| } |
There was a problem hiding this comment.
nit: XOR operator is the usual way to handle this type of check.
| if (autoScaler != null && autoScalers != null) { | |
| throw new IllegalArgumentException("The autoScaler and autoScalers properties are mutually exclusive"); | |
| } | |
| if (autoScaler == null && autoScalers == null) { | |
| throw new IllegalArgumentException("Either autoScaler or autoScalers property must be provided"); | |
| } | |
| if (autoScaler == null ^ autoScalers == null) { | |
| throw new IllegalArgumentException("Either(and only one of) autoScaler or autoScalers property must be provided"); | |
| } |
There was a problem hiding this comment.
I made this change but now I get a LGTM alert. It complains that the autoScalers variable can be null in line 64 which would lead to an NPE, but in my opinion LGTM is wrong.
Can you advise?
There was a problem hiding this comment.
I agree, lgtm is just wrong in this case.
Also, I see that you are using == instead of ^, while they both produce same result due to interning of boolean values I would prefer ^ which has clearly defined XOR meaning and reader for the code would get immediately context. == is "tricky" way to do what ^ is doing.
| ); | ||
|
|
||
| // select worker according to worker category spec | ||
| if (workerCategorySpec != null) { |
There was a problem hiding this comment.
this check is again done in getTaskCategory(..) and is redundant.
| } | ||
|
|
||
| /** | ||
| * |
There was a problem hiding this comment.
nit: not sure why this file changed at all :)
There was a problem hiding this comment.
;)
right. I reverted the file.
| List<ImmutableWorkerInfo> categoryWorkers = workersByCategories.getOrDefault(category, Collections.emptyList()); | ||
| currentlyProvisioningMap.putIfAbsent(category, new HashSet<>()); | ||
| Set<String> currentlyProvisioning = this.currentlyProvisioningMap.get(category); | ||
| currentlyTerminatingMap.putIfAbsent(category, new HashSet<>()); | ||
| Set<String> currentlyTerminating = this.currentlyTerminatingMap.get(category); |
There was a problem hiding this comment.
nit: replacement similar to other class could be used to avoid instantiation of HashSet here too.
| currentlyProvisioningMap.putIfAbsent(category, new HashSet<>()); | ||
| Set<String> currentlyProvisioning = this.currentlyProvisioningMap.get(category); | ||
| currentlyTerminatingMap.putIfAbsent(category, new HashSet<>()); | ||
| Set<String> currentlyTerminating = this.currentlyTerminatingMap.get(category); |
There was a problem hiding this comment.
nit: same here regarding HashSet instantiation
| currentlyTerminatingMap.putIfAbsent(category, new HashSet<>()); | ||
| Set<String> currentlyTerminating = this.currentlyTerminatingMap.get(category); | ||
|
|
||
| didProvision = doProvision( |
There was a problem hiding this comment.
nit: code till here appears to be repeated in many places, if possible, then refactor it into ProvisioningUtil maybe
There was a problem hiding this comment.
you are perfectly right about the repeated code logic. However, I'm not able to refactor anything. For one thing, the way that responsibilities are spread across classes seems to make it impossible to refactor just the affected methods but would require a larger rewrite.
On a pragmatic level I'm also facing the issue that executing "mvn test" on just the druid-indexing module takes 15 minutes. If I try to execute the respective testsuite alone (SimpleProvisioningStrategyTest) in IntelliJ, I get an ANTL error:
/opt/repos/druid/core/src/main/java/org/apache/druid/math/expr/ExprListenerImpl.java
Error:(28, 40) java: package org.apache.druid.math.expr.antlr does not exist
Can you advise?
Also, one innocent question: I see these "resolve conversation" buttons and wonder whether I am supposed to mark a conversation as resolved if I believe to have addressed a review comment or whether it is rather meant to be the reviewer's privilege to do that.
There was a problem hiding this comment.
thanks for checking, I haven't looked that refactoring in details but I believe you.
for the antlr error to go away, do "mvn clean install -DskipTests" in core module whenever you switch branches.
I am supposed to mark a conversation as resolved if I believe to have addressed a review comment or whether it is rather meant to be the reviewer's privilege to do that.
it is a new feature in github and I don't think there is any formal agreement in Druid community over this. But, I would say, PR author should feel free to mark the conversation resolved when they think that they addressed it. If Reviewer doesn't agree then they can mark it "unresolved" and conversation can be further resumed.
There was a problem hiding this comment.
Just to randomly chime in, as a reviewer I definitely prefer that the author does not mark my comments as resolved, it makes it harder to come back to the PR and find all of my comments and check to make sure that they are indeed resolved. I think it would be better for whoever left the review comment to resolve it when they are satisfied, but maybe we should reach a consensus as a community on this.
There was a problem hiding this comment.
I see, I spent a little more time today to read about the effects of resolving in https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-request-reviews#resolving-conversations . Resolving basically folds the conversation so as reviewer, I can see, you might not want that to happen till you verify whether your comment is addressed.
However, In that case reviewers also should have the responsibility of actively marking their comments resolved as and when they are addressed or else PR author is in doubt whether the change has been accepted by reviewer. I haven't really been doing that while reviewing the PRs and that is probably what confused @sascha-coenen .
Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…sioningStrategy (apache#9350) Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…sioningStrategy (apache#9350) Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…sioningStrategy (apache#9350) Co-Authored-By: Himanshu <g.himanshu@gmail.com>
|
This pull request introduces 1 alert when merging cd3c76d into 2ef5c17 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging eb3a001 into 09600db - view on LGTM.com new alerts:
|
|
+1 after #9350 (comment) is addressed. thanks for your patience. |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |

Fixes #8695.
Description
Added support of many autoscalers of different categories to serve tasks of according category.
This PR has:
(https://github.com/apache/incubator-druid/blob/master/licenses.yaml)
Key changed/added classes in this PR
CategoriedWorkerBehaviorConfigCategoriedWorkerProvisioningStrategyCategoriedWorkerSelectStrategyPendingTaskBasedWorkerProvisioningStrategySimpleWorkerProvisioningStrategy