Skip to content

Add dynamic coordinator config that allows control over how many segments are considered when picking a segment to move.#10284

Merged
suneet-s merged 31 commits intoapache:masterfrom
capistrant:bounded-segment-balancer
Dec 22, 2020
Merged

Add dynamic coordinator config that allows control over how many segments are considered when picking a segment to move.#10284
suneet-s merged 31 commits intoapache:masterfrom
capistrant:bounded-segment-balancer

Conversation

@capistrant
Copy link
Copy Markdown
Contributor

@capistrant capistrant commented Aug 14, 2020

Release Notes

A new Coordinator Dynamic Config, percentOfSegmentsToConsiderPerMove, has been added. This configuration specifies the percent of served segments that will be considered when picking a segment to potentially move. The coordinator uses this value and the number of currently served segments to calculate the number of segments that will be candidates. It will not consider any segments beyond that calculated number when picking a segment to move. The default value is 100, meaning the coordinator will consider all segments every time it is picking a segment to move. This default leads to the same behavior that existed prior to this configuration being added.

Description

A large cluster with many segments results in a lot of work being done by the Coordinator in order to complete its duties. I believe that any optimization to coordinator duties can help in a large cluster. This patch gives an experienced admin a knob to turn in order to try and shave some time off of the balance segments duty. As of now, existing Balancer Strategies iterate over all of the segments in the cluster when choosing a segment to move. The first segment candidate is the most likely to be moved and the last segment candidate is the least likely to move. This patch gives an admin the ability to put a limit on the number of segments that will be candidates to be moved. For most cases, I don't think this knob will be needed, but in some large enterprise cases I feel that it could be beneficial.

I updated the BalancerStrategy Interface. The pickSegmentToMove method gained a 3rd parameter that specifies the number of segments that should be considered when picking a segment to move.

CostBalancerStrategy (and it's inheriting classes) and RandomBalancerStrategy both leverage ReservoirSegmentSampler to choose a segment "at random" from a list of candidate servers. I updated the required method in ReservoirSegmentSampler to adhere to the limiting parameter described above. If the limit is reached, the method picking a segment will break out of its iteration and return immediately.

Currently all code paths use a new dynamic coordinator config that an admin can tune if they'd like to put a limiter on this action of picking a segment to move. The default value for the config is such that all segments will be iterated and be candidates to pick. I thought a dynamic config was good because it is flexible and could be leveraged in times such as if you wanted to temporarily boost up the number of segments to move in order rebalance to new servers faster. If doing that, and you also wanted to make this go quicker by not bothering with having so many potential segments to be picked to move.

The new dynamic config is maxSegmentsToConsiderPerMove with a default of Integer.MAX_VALUE

I call out in the documentation that an admin should be experienced when considering altering this config. I say that because in many cases, the default is fine.

An alternative of this approach would be to restrict what is sent to pickSegmentToMove in the first place. I choose not to approach this at this time because I didn't like the idea of either choosing the number of ServerHolders to send to pickSegmentsToMove or to analyze the ServerHolders before picking how many to send to ensure only a certain number of segments are sent. I'd be open to re-assessing whether or not this would be a better approach or not if someone suggests it may be the proper approach.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • BalancerStrategy Interface
  • CostBalancerStrategy
  • RandomBalancerStrategy
  • ReservoirSegmentSampler
  • CoordinatorDynamicConfig

add new dynamic coordinator config, maxSegmentsToConsiderPerMove. This
config caps the number of segments that are iterated over when selecting
a segment to move. The default value combined with current balancing
strategies will still iterate over all provided segments. However,
setting this value to something > 0 will cap the number of segments
visited. This could make sense in cases where a cluster has a very large
number of segments and the admins prefer less iterations vs a thorough
consideration of all segments provided.
@capistrant
Copy link
Copy Markdown
Contributor Author

I recently became a committer, but haven't gotten repo access setup yet (If another committer reads this and is willing to guide me on this process, I'm all ears!) so I cannot complete all of the committer steps. For labels, I would add:

  • Design Review - It does add a config and change the balancing strategy interface so a design review seems necessary/smart
  • Feature
  • Performance (I did benchmark this in a dev cluster and we will see shaved time in the pickSegmentToMove code, but it will likely only be noticeable/worthwhile at very large scale)
  • Area - Segment Balancing/Coordination

@a2l007
Copy link
Copy Markdown
Contributor

a2l007 commented Aug 14, 2020

Would it make sense to perform the segment limiting for each ServerHolder individually rather than short circuiting the entire selection flow? Say if set maxSegmentsToConsiderPerMove to 50, it will consider only 50 segments per ServerHolder after which it will run the selection process for the next ServerHolder. That way, we can ensure that ReservoirSegmentSampler considers all the given servers before making the selection.
Also it might be more convenient for the user to express this parameter as a percentage of segments to select from rather than a number, but i guess that might take away some of the performance gain that we're trying to achieve here.

@capistrant
Copy link
Copy Markdown
Contributor Author

Would it make sense to perform the segment limiting for each ServerHolder individually rather than short circuiting the entire selection flow? Say if set maxSegmentsToConsiderPerMove to 50, it will consider only 50 segments per ServerHolder after which it will run the selection process for the next ServerHolder. That way, we can ensure that ReservoirSegmentSampler considers all the given servers before making the selection.

If the ServerHolder objects were passed in an indeterminate order, I would agree. But since they are not, I think that doing this could start to work against what I am trying to accomplish here. As an admin, I have determined that we are wasting our time bothering to consider segments once we have considered a certain amount of them. Because, after that point, the probability of choosing one of these segments is so low that I have deemed it to be not worth the resources used to consider it. My implementation says, if under normal balancing conditions (no limit), that all the segments beyond some point (probably) wouldn't be chosen anyways, let's just bail out early. Your suggestion here somewhat changes the fundamentals of the balancing strategy by now considering segments that would under normal balancing conditions, have had near zero chance of being chosen to now having a chance of being chosen, that I as an admin, have determined is likely enough to spend the resources to find out. And worse yet, those segments that may be chosen from ServerHolders at the end of the list are from Servers that I would probably prefer to be receiving segments rather than giving them up (in terms of consumption, not locality/performance), unless my cluster is very well balanced by consumption already.

Perhaps what we really want here is a way to configure the balancer to use one of many(?) implementations of "pick a segment from a list of ServerHolder objects". Instead of always using ReservoirSegmentSampler#getRandomBalancerSegmentHolder there could be multiple implementations of this action and an option for the admin to choose one with the current one being the default. I did slightly consider this in my head when I was thinking up the solution, but decided it may be too much of a heavy lift if the problem I am addressing is unique to certain large deployments who have tenants stressing over the execution time of a coordination cycle. Instead of combing up with a way to configure a selector and then add my slightly different selector with static coordinator configs instead of the dynamic config, I kind of wedged my implementation desire into the existing implementation.

Also it might be more convenient for the user to express this parameter as a percentage of segments to select from rather than a number, but i guess that might take away some of the performance gain that we're trying to achieve here.

I had not considered this initially, but it is a good idea. I'd be okay with going this route if it is trivial to calculate the number of segments in the cluster including replication. I'll have to look at that.

@a2l007
Copy link
Copy Markdown
Contributor

a2l007 commented Aug 14, 2020

If the ServerHolder objects were passed in an indeterminate order, I would agree. But since they are not, I think that doing this could start to work against what I am trying to accomplish here. As an admin, I have determined that we are wasting our time bothering to consider segments once we have considered a certain amount of them. Because, after that point, the probability of choosing one of these segments is so low that I have deemed it to be not worth the resources used to consider it. My implementation says, if under normal balancing conditions (no limit), that all the segments beyond some point (probably) wouldn't be chosen anyways, let's just bail out early. Your suggestion here somewhat changes the fundamentals of the balancing strategy by now considering segments that would under normal balancing conditions, have had near zero chance of being chosen to now having a chance of being chosen, that I as an admin, have determined is likely enough to spend the resources to find out. And worse yet, those segments that may be chosen from ServerHolders at the end of the list are from Servers that I would probably prefer to be receiving segments rather than giving them up (in terms of consumption, not locality/performance), unless my cluster is very well balanced by consumption already.

Thanks for clarifying. My suggestion was basically ignoring the fact that the serverholder list is sorted by available space. In case we don't proceed to go the percentage route in this PR, it may be helpful to provide some recommendation in the docs on what might be a good value to set this to percentage-wise.

@jihoonson
Copy link
Copy Markdown
Contributor

I recently became a committer, but haven't gotten repo access setup yet (If another committer reads this and is willing to guide me on this process, I'm all ears!) so I cannot complete all of the committer steps. For labels, I would add:

* Design Review - It does add a config and change the balancing strategy interface so a design review seems necessary/smart

* Feature

* Performance (I did benchmark this in a dev cluster and we will see shaved time in the pickSegmentToMove code, but it will likely only be noticeable/worthwhile at very large scale)

* Area - Segment Balancing/Coordination

Hey @capistrant, have you finished the setup for GitBox? You can check it here: https://gitbox.apache.org/setup/.

@capistrant
Copy link
Copy Markdown
Contributor Author

If the ServerHolder objects were passed in an indeterminate order, I would agree. But since they are not, I think that doing this could start to work against what I am trying to accomplish here. As an admin, I have determined that we are wasting our time bothering to consider segments once we have considered a certain amount of them. Because, after that point, the probability of choosing one of these segments is so low that I have deemed it to be not worth the resources used to consider it. My implementation says, if under normal balancing conditions (no limit), that all the segments beyond some point (probably) wouldn't be chosen anyways, let's just bail out early. Your suggestion here somewhat changes the fundamentals of the balancing strategy by now considering segments that would under normal balancing conditions, have had near zero chance of being chosen to now having a chance of being chosen, that I as an admin, have determined is likely enough to spend the resources to find out. And worse yet, those segments that may be chosen from ServerHolders at the end of the list are from Servers that I would probably prefer to be receiving segments rather than giving them up (in terms of consumption, not locality/performance), unless my cluster is very well balanced by consumption already.

Thanks for clarifying. My suggestion was basically ignoring the fact that the serverholder list is sorted by available space. In case we don't proceed to go the percentage route in this PR, it may be helpful to provide some recommendation in the docs on what might be a good value to set this to percentage-wise.

Sorry about accidentally editing your previous comment. Was not paying attention to what I was doing. I reverted the changes I made. On to what I was trying to say:

Finally circling back to this now that I have time... I decided to just update the doc for now. I am still trying to determine if the overhead to do percentages would require compute overhead that defeats the purpose of this. Looking at the code as I type this update.. ServerHolder contains an ImmutableDruidServer which has int instance variable numSegments. So I guess we could quickly get numSegments since it was essentially pre-computed for us. I worried that we'd have to stream the segments map to get the number on the fly. So all in, we'd iterate the ServerHolders first and get the number of segments and then calculate our % and use the same code with this new limiter. I do really like this as it is more flexible for long term durability. The raw number approach may require TLC when segment and server counts change significantly. I think we could chalk up the overhead to calculate the number of segments to consider as negligible. Say a cluster has 200 servers with 10k segments per server. If we do 25% we will add a loop that iterates a 200 item list and do a little math that will reduce the number of segment considerations from 2MM to 500k. We'd also skip the calculation if the value is 100% so still no added overhead for our default value users.

tl;dr I am going to implement the percentages approach instead of the raw number. The compute overhead to calculate the number of segments to iterate seems negligible compared to what we will save in the cases where this will be used (large clusters).

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Sep 4, 2020

This pull request introduces 1 alert when merging b3f680d into 3fc8bc0 - view on LGTM.com

new alerts:

  • 1 for Missing space in string literal

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Sep 8, 2020

This pull request introduces 1 alert when merging a8a7c42 into 176b715 - view on LGTM.com

new alerts:

  • 1 for Missing space in string literal

@capistrant
Copy link
Copy Markdown
Contributor Author

@a2l007 checking in to see if you will have some time to review this PR now that I have changed the implementation to use a percentage approach vs raw number.

@a2l007
Copy link
Copy Markdown
Contributor

a2l007 commented Oct 30, 2020

Yeah sure, should be able to review it soon.

Copy link
Copy Markdown
Contributor

@a2l007 a2l007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall changes LGTM
Can we please add this property to the Coordinator Dynamic Config dialog box in the web console as well?

Comment thread docs/configuration/index.md Outdated
|`mergeBytesLimit`|The maximum total uncompressed size in bytes of segments to merge.|524288000L|
|`mergeSegmentsLimit`|The maximum number of segments that can be in a single [append task](../ingestion/tasks.md).|100|
|`maxSegmentsToMove`|The maximum number of segments that can be moved at any given time.|5|
|`percentOfSegmentsToConsiderPerMove`|The percentage of the total number of segments in the cluster that are considered every time a segment needs to be selected for a move. Druid orders servers by available capacity ascending (the least available capacity first) and then iterates over the servers. For each server, Druid iterates over the segments on the server, considering them for moving. The default config of 100% means that every segment on every server is a candidate to be moved. This should make sense for most small to medium-sized clusters. However, an admin may find it preferable to drop this value lower if they don't think that it is worthwhile to consider every single segment in the cluster each time it is looking for a segment to move.|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the default value for this property into the Default column?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it would be helpful to provide some sort of recommended percentage value that operators with large clusters can use as a starting point for this property. We could either add it here or in the Basic Cluster Tuning doc.

* @param percentOfSegmentsToConsider The % of total cluster segments to consider before short-circuiting and
* returning immediately.
* @return
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc 👍

@capistrant
Copy link
Copy Markdown
Contributor Author

Overall changes LGTM
Can we please add this property to the Coordinator Dynamic Config dialog box in the web console as well?

Good call out. I will add that. Added some more docs for helping guide when and how to set this value.

@a2l007
Copy link
Copy Markdown
Contributor

a2l007 commented Nov 3, 2020

Could you please fix the spellcheck and console test errors as well?

@suneet-s
Copy link
Copy Markdown
Contributor

suneet-s commented Nov 3, 2020

@capistrant Can we write integration tests for this dynamic config? Since it's not the default and used infrequently, I would hate for a refactoring to break this functionality and us not know about it

@capistrant
Copy link
Copy Markdown
Contributor Author

capistrant commented Nov 3, 2020

@capistrant Can we write integration tests for this dynamic config? Since it's not the default and used infrequently, I would hate for a refactoring to break this functionality and us not know about it

Good point, @suneet-s. I like the idea of protecting this against regressions, but I'm wondering if there is the proper plumbing in place to make this easy to test as is. Wondering if you have any inputs/ideas on how it may be accomplished.

We want to test the integration between the Coordinator's dynamic config and the act of balancing the cluster. Specifically, we want to ensure that our new config is being honored by Coordinator, meaning that balancing considers the expected % of segments when looking for segments to move.

The easiest way I can think of is that we have our integration tests cluster with a known number of segments in it. We set the config to some low % and monitor to make sure that no coordination cycle involves more moves than the number of segments in the cluster * (the config value / 100). But what exactly would we monitor? The # of moves and segments let alone is logged by Druid currently. But do we really want to scrape logs? We could emit balancing stats and monitor the emitter. But not all balancer strategy implementations emit the proper fields, so we would need to modify that. I guess EmitClusterStatsAndMetrics emit segment/moved/count. could we perform an integration test that monitors emitter output?

The above section with strike through was me thinking of the wrong metric we are testing. we don't care about the number of segments moved. we care about the number of segments considered per move. back to the drawing board on ideas for this.

Do you have any ideas that I may be overlooking?

@capistrant
Copy link
Copy Markdown
Contributor Author

@suneet-s This ended up being a hard one to come up with an integration test for since since it is hard to analyze what the code is doing and whether or not it is honoring this config. However, my latest commit adds a new test to BalanceSegmentsTest where it validates that pickSegmentToMove is called using the non-default value that the test uses for the new dynamic config. This provides some protection against regressions. What do you think about it?

@suneet-s
Copy link
Copy Markdown
Contributor

@capistrant Sorry about not getting back to you on your last comment. If we set the percentage to a low enough value(1%) - could we verify that Druid never moves a segment because it never has any to consider. Not sure if this is possible, but I figured I'd throw this out there.

The test you added to BalanceSegmentsTest seems like a good step to protecting us from regressions here.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed through the change and had some comments on the type of the parameter. Let me know what you think

this.maxSegmentsToMove = maxSegmentsToMove;
// This helps with ease of migration to the new config, but could confuse users. Docs explicitly state this value must be > 0
if (percentOfSegmentsToConsiderPerMove <= 0 || percentOfSegmentsToConsiderPerMove > 100) {
this.percentOfSegmentsToConsiderPerMove = 100;
Copy link
Copy Markdown
Contributor

@suneet-s suneet-s Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a json configured property, I think we should throw an exception here so the user knows they've done something incorrect.

EDIT: Something like
Preconditions.checkArgument(percentOfSegmentsToConsiderPerMove > 0 && percentOfSegmentsToConsiderPerMove <= 100, "percentOfSegmentsToConsiderPerMove should be between 0 and 100!");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's confusing to mask an error like I was. fixed

@JsonProperty("mergeBytesLimit") long mergeBytesLimit,
@JsonProperty("mergeSegmentsLimit") int mergeSegmentsLimit,
@JsonProperty("maxSegmentsToMove") int maxSegmentsToMove,
@JsonProperty("percentOfSegmentsToConsiderPerMove") int percentOfSegmentsToConsiderPerMove,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a double since we're dealing with percentages? I think it would make it less likely for someone in the future to make some math mistake by accidentally dividing by an integer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, this is a good point. using double up front makes things a lot more straightforward

this.mergeBytesLimit = mergeBytesLimit;
this.mergeSegmentsLimit = mergeSegmentsLimit;
this.maxSegmentsToMove = maxSegmentsToMove;
this.percentOfSegmentsToConsiderPerMove = percentOfSegmentsToConsiderPerMove;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a PreCondition check that it falls between 0 - 100 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, Since this is a Builder class that builds a CoordinatorDynamicConfig object, I think this precondition is covered by the actual constructor for the CoordinatorDynamicConfig class. IMO, having another precondition here is redundant

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

// Reset a bad value of percentOfSegmentsToConsider to 100. We don't allow consideration less than or equal to
// 0% of segments or greater than 100% of segments.
if (percentOfSegmentsToConsider <= 0 || percentOfSegmentsToConsider > 100) {
log.debug("Resetting percentOfSegmentsToConsider to 100 because only values from 1 to 100 are allowed."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be at least a WARN since it should be impossible that percentOfSegmentsToConsider should have already been checked earlier in the system

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm cool with this being warn

@suneet-s suneet-s merged commit 58ce2e5 into apache:master Dec 22, 2020
@jihoonson jihoonson added this to the 0.21.0 milestone Jan 4, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
…ents are considered when picking a segment to move. (apache#10284)

* dynamic coord config adding more balancing control

add new dynamic coordinator config, maxSegmentsToConsiderPerMove. This
config caps the number of segments that are iterated over when selecting
a segment to move. The default value combined with current balancing
strategies will still iterate over all provided segments. However,
setting this value to something > 0 will cap the number of segments
visited. This could make sense in cases where a cluster has a very large
number of segments and the admins prefer less iterations vs a thorough
consideration of all segments provided.

* fix checkstyle failure

* Make doc more detailed for admin to understand when/why to use new config

* refactor PR to use a % of segments instead of raw number

* update the docs

* remove bad doc line

* fix typo in name of new dynamic config

* update RservoirSegmentSampler to gracefully deal with values > 100%

* add handler for <= 0 in ReservoirSegmentSampler

* fixup CoordinatorDynamicConfigTest naming and argument ordering

* fix items in docs after spellcheck flags

* Fix lgtm flag on missing space in string literal

* improve documentation for new config

* Add default value to config docs and add advice in cluster tuning doc

* Add percentOfSegmentsToConsiderPerMove to web console coord config dialog

* update jest snapshot after console change

* fix spell checker errors

* Improve debug logging in getRandomSegmentBalancerHolder to cover all bad inputs for % of segments to consider

* add new config back to web console module after merge with master

* fix ReservoirSegmentSamplerTest

* fix line breaks in coordinator console dialog

* Add a test that helps ensure not regressions for percentOfSegmentsToConsiderPerMove

* Make improvements based off of feedback in review

* additional cleanup coming from review

* Add a warning log if limit on segments to consider for move can't be calcluated

* remove unused import

* fix tests for CoordinatorDynamicConfig

* remove precondition test that is redundant in CoordinatorDynamicConfig Builder class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants