Skip to content

Overlord to support autoscalers per indexer/middlemanager category#9350

Closed
sascha-coenen wants to merge 38 commits intoapache:masterfrom
smaato:feature-8695
Closed

Overlord to support autoscalers per indexer/middlemanager category#9350
sascha-coenen wants to merge 38 commits intoapache:masterfrom
smaato:feature-8695

Conversation

@sascha-coenen
Copy link
Copy Markdown

@sascha-coenen sascha-coenen commented Feb 11, 2020

Fixes #8695.

Description

Added support of many autoscalers of different categories to serve tasks of according category.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
    (https://github.com/apache/incubator-druid/blob/master/licenses.yaml)
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • CategoriedWorkerBehaviorConfig
  • CategoriedWorkerProvisioningStrategy
  • CategoriedWorkerSelectStrategy
  • PendingTaskBasedWorkerProvisioningStrategy
  • SimpleWorkerProvisioningStrategy

Vladimir Iordanov and others added 24 commits December 4, 2019 15:37
 - Changed behavior config structure.
 - Small refactoring
 - Extended unit tests
 - Added "category" field into Autoscaler
 - Changed config format to support the "category" field
 - Fixed parameters of unit tests
 - Put category information into Autoscaler
 - Changed structure of behavior config
 - Changed CategoriedProvisioningStrategy accordingly
 - Fixed bug in PendingTaskBasedWorkerProvisioningStrategy
 - Extended unit tests
 - Code cleanup and refactoring
 - Introduced new logic into the existing strategies
 - Added more test cases
 - Refactored unit tests
 - Used Default Worker Category as a fallback for null worker categories
 - Refactored legacy code to avoid NPE
 - Removed duplicates
 - Added more debug info
 - Removed duplicated code
Overlord can support many autoscalers of different categories apache#8989
@sascha-coenen sascha-coenen requested review from QiuMM and gianm February 11, 2020 15:57
@sascha-coenen
Copy link
Copy Markdown
Author

Just FYI: we had to move this code contribution from one github repo to another one today.
So we removed the pevious PR that had not been reviewed yet with this one. No code has changed, its just a new PR that originates from a different repo now.
Sorry for the inconvenience.

… Add documentation and javadoc comments. Fixed naming of one class.
@sascha-coenen sascha-coenen changed the title Overlord can support many autoscalers of different categories Overlord to support autoscalers per indexer/middlemanager category Feb 19, 2020
… fix spelling in documentation and code to pass spellchecker tests
@sascha-coenen
Copy link
Copy Markdown
Author

the spellchecker was failing on a naming issue, so I fixed the spelling in the documentation and also changed wrongly spelled type names in the code accordingly.
Now all checks have passed.

Copy link
Copy Markdown
Contributor

@himanshug himanshug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partially reviewed, will continue later. just trying to understand things at this point.

Comment thread docs/configuration/index.md Outdated

import org.joda.time.Period;

public class CategorizedProvisioningConfig extends PendingTaskBasedWorkerProvisioningConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note-to-self: Couldn't understand the point of this class as this is overriding all methods and just calling super.xx(..) , maybe I will understand more as I read rest of the PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be removed. Please ignore.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vladi.
Just to update the status: I removed all orhpaned classes.
Changes to the DefaultWorkerBehaviorConfig class are still pending

private final List<AutoScaler> autoScalers;

@JsonCreator
public CategorizedWorkerBehaviorConfig(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things will be simplified if we removed this class and instead made DefaultWorkerBehaviorConfig have List<AutoScaler> autoScalers with constructor

  @JsonCreator
  public DefaultWorkerBehaviorConfig(
      @JsonProperty("selectStrategy") WorkerSelectStrategy selectStrategy,
      @JsonProperty("autoScaler") AutoScaler autoScaler,
      @JsonProperty("autoScalers") List<AutoScaler> autoScalers
  )
  {
    this.selectStrategy = selectStrategy;
    // fail if both "autoscalers" and "autoscaler"  fields are non-null
    this.autoScalers = autoScalers or Collections.singletonList(autoscaler);
  }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you thing it will not confuse users? I'm about having two sections related to autoscalers in the same config.
I'm ok for that. Initially I also had such decision to just extend DefaultWorkerBehaviorConfig. But later I thought that having two sections for autoscalers can be not good for user experience.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, We are keeping autoScaler field only for backward compatibility .. we should really deprecate it and remove its mention from the docs. Eventually it should be removed and we should just have autoScalers .

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. No objections. Let's move the autoScalers section to DefaultWorkerBehaviorConfig and remove CategorizedWorkerBehaviorConfig.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to refactor the code based on the above review comments.
The class CategorizedWorkerBehaviorConfig has been removed, its logic been merged into the DefaultworkerBehaviorConfig.
The constructor has been modified according to himanshug's post above.

One thing I noticed recently, is that the new web-console is not treating the behaviour config as a single json object but has separate text boxes for select strategy and autoscaler.

What shall we do about this?

overlord-dynamic-config

Copy link
Copy Markdown
Contributor

@himanshug himanshug Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means, users who want to use autoScalers can't use the console UI to set the spec and would instead have to manually send the http request to coordinator using curl or whatever.
after this PR is merged, an issue should be created for console UI updates to make it possible to add autoScalers .
In this PR, we need to make sure that we stay totally backwards compatible.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. thanks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am totally down to update the web console to reflect the new functionality.

* a worker's category.
*/
@JsonTypeName("categorizedTaskBased")
public class CategorizedWorkerProvisioningStrategy extends AbstractWorkerProvisioningStrategy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this is different from PendingTaskBasedWorkerProvisioningStrategy ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the difference I recalled that it is leftover which can be removed along with CategorizedProvisioningConfig. Originally I separately created the new strategy as an intermediate step to check that I didn't affect the original strategy. So I just forgot to remove it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is what I guessed :)

categorization really cuts across all implementations of ProvisioningStrategy and hence shouldn't be a separate impl on its own.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to refactor the code based on the above review comments. The class CategorizedWorkerProvisioningStrategy has been removed.

Copy link
Copy Markdown
Contributor

@himanshug himanshug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed some more, will continue.

Comment thread docs/configuration/index.md Outdated
|`selectStrategy`|How to assign tasks to MiddleManagers. Choices are `fillCapacity`, `equalDistribution`, and `javascript`.|equalDistribution|
|`autoScaler`|Only used if autoscaling is enabled. See below.|null|

##### Categorized Worker Config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section should be removed and autoScaler be replaced with autoScalers above.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

this.autoScaler = autoScaler;
this.autoScalers = (autoScaler != null) ? Collections.singletonList(autoScaler) : autoScalers;
if (this.autoScalers == null) {
throw new IllegalArgumentException("Either autoScaler or autoScalers property needs to be provided");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove mention of autoScaler property and deprecate it .. so that it gets removed eventually.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add two more checks ?

  1. make sure both autoScaler and autoScalers aren't provided
  2. if autoScalers is provided, make sure each autoScaler element has a different category.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return null;
}
final DefaultWorkerBehaviorConfig workerConfig = (DefaultWorkerBehaviorConfig) workerBehaviorConfig;
if (workerConfig.getAutoScalers() == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultWorkerBehaviorConfig constructor shouldn't/doesn't allow instantiation with null autoScalers so this check is redundant.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

));

if (result.size() != autoScalers.size()) {
log.warn(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the check in DefaultWorkerBehaviorConfig to ensure autoscalers with duplicate category aren't provided and fail? This is probably a case of user making an error in configuration.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I changed that.

sascha-coenen and others added 2 commits March 4, 2020 00:22
…ord/setup/CategorizedWorkerSelectStrategy.java

Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…ourConfig, doc fixes, clean up in ProvisioningUtil (apache#9350)
Copy link
Copy Markdown
Contributor

@himanshug himanshug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay, I have gone through whole PR now.

Comment on lines +58 to +63
if (autoScaler != null && autoScalers != null) {
throw new IllegalArgumentException("The autoScaler and autoScalers properties are mutually exclusive");
}
if (autoScaler == null && autoScalers == null) {
throw new IllegalArgumentException("Either autoScaler or autoScalers property must be provided");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: XOR operator is the usual way to handle this type of check.

Suggested change
if (autoScaler != null && autoScalers != null) {
throw new IllegalArgumentException("The autoScaler and autoScalers properties are mutually exclusive");
}
if (autoScaler == null && autoScalers == null) {
throw new IllegalArgumentException("Either autoScaler or autoScalers property must be provided");
}
if (autoScaler == null ^ autoScalers == null) {
throw new IllegalArgumentException("Either(and only one of) autoScaler or autoScalers property must be provided");
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change but now I get a LGTM alert. It complains that the autoScalers variable can be null in line 64 which would lead to an NPE, but in my opinion LGTM is wrong.
Can you advise?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, lgtm is just wrong in this case.

Also, I see that you are using == instead of ^, while they both produce same result due to interning of boolean values I would prefer ^ which has clearly defined XOR meaning and reader for the code would get immediately context. == is "tricky" way to do what ^ is doing.

);

// select worker according to worker category spec
if (workerCategorySpec != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is again done in getTaskCategory(..) and is redundant.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

/**
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure why this file changed at all :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;)
right. I reverted the file.

Comment on lines +204 to +208
List<ImmutableWorkerInfo> categoryWorkers = workersByCategories.getOrDefault(category, Collections.emptyList());
currentlyProvisioningMap.putIfAbsent(category, new HashSet<>());
Set<String> currentlyProvisioning = this.currentlyProvisioningMap.get(category);
currentlyTerminatingMap.putIfAbsent(category, new HashSet<>());
Set<String> currentlyTerminating = this.currentlyTerminatingMap.get(category);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replacement similar to other class could be used to avoid instantiation of HashSet here too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +332 to +335
currentlyProvisioningMap.putIfAbsent(category, new HashSet<>());
Set<String> currentlyProvisioning = this.currentlyProvisioningMap.get(category);
currentlyTerminatingMap.putIfAbsent(category, new HashSet<>());
Set<String> currentlyTerminating = this.currentlyTerminatingMap.get(category);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same here regarding HashSet instantiation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

currentlyTerminatingMap.putIfAbsent(category, new HashSet<>());
Set<String> currentlyTerminating = this.currentlyTerminatingMap.get(category);

didProvision = doProvision(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: code till here appears to be repeated in many places, if possible, then refactor it into ProvisioningUtil maybe

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are perfectly right about the repeated code logic. However, I'm not able to refactor anything. For one thing, the way that responsibilities are spread across classes seems to make it impossible to refactor just the affected methods but would require a larger rewrite.
On a pragmatic level I'm also facing the issue that executing "mvn test" on just the druid-indexing module takes 15 minutes. If I try to execute the respective testsuite alone (SimpleProvisioningStrategyTest) in IntelliJ, I get an ANTL error:

/opt/repos/druid/core/src/main/java/org/apache/druid/math/expr/ExprListenerImpl.java
Error:(28, 40) java: package org.apache.druid.math.expr.antlr does not exist

Can you advise?

Also, one innocent question: I see these "resolve conversation" buttons and wonder whether I am supposed to mark a conversation as resolved if I believe to have addressed a review comment or whether it is rather meant to be the reviewer's privilege to do that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking, I haven't looked that refactoring in details but I believe you.

for the antlr error to go away, do "mvn clean install -DskipTests" in core module whenever you switch branches.

I am supposed to mark a conversation as resolved if I believe to have addressed a review comment or whether it is rather meant to be the reviewer's privilege to do that.

it is a new feature in github and I don't think there is any formal agreement in Druid community over this. But, I would say, PR author should feel free to mark the conversation resolved when they think that they addressed it. If Reviewer doesn't agree then they can mark it "unresolved" and conversation can be further resumed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to randomly chime in, as a reviewer I definitely prefer that the author does not mark my comments as resolved, it makes it harder to come back to the PR and find all of my comments and check to make sure that they are indeed resolved. I think it would be better for whoever left the review comment to resolve it when they are satisfied, but maybe we should reach a consensus as a community on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I spent a little more time today to read about the effects of resolving in https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-request-reviews#resolving-conversations . Resolving basically folds the conversation so as reviewer, I can see, you might not want that to happen till you verify whether your comment is addressed.
However, In that case reviewers also should have the responsibility of actively marking their comments resolved as and when they are addressed or else PR author is in doubt whether the change has been accepted by reviewer. I haven't really been doing that while reviewing the PRs and that is probably what confused @sascha-coenen .

sascha-coenen and others added 4 commits March 11, 2020 22:13
Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…sioningStrategy (apache#9350)

Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…sioningStrategy (apache#9350)

Co-Authored-By: Himanshu <g.himanshu@gmail.com>
…sioningStrategy (apache#9350)

Co-Authored-By: Himanshu <g.himanshu@gmail.com>
@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Mar 11, 2020

This pull request introduces 1 alert when merging cd3c76d into 2ef5c17 - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Mar 16, 2020

This pull request introduces 1 alert when merging eb3a001 into 09600db - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

@himanshug
Copy link
Copy Markdown
Contributor

+1 after #9350 (comment) is addressed. thanks for your patience.

@stale
Copy link
Copy Markdown

stale Bot commented May 20, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale Bot added the stale label May 20, 2020
@stale
Copy link
Copy Markdown

stale Bot commented Jun 17, 2020

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale Bot closed this Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Overlord can support many autoscalers of different categories

5 participants