Coordinator fix balancer stuck#5987
Merged
jon-wei merged 6 commits intoapache:masterfrom Jul 12, 2018
Merged
Conversation
Member
Author
|
test failure appears related |
Member
Author
|
Current failures seem unrelated |
jon-wei
approved these changes
Jul 12, 2018
clintropolis
added a commit
to implydata/druid-public
that referenced
this pull request
Jul 18, 2018
* this will fix it * filter destinations to not consider servers already serving segment * fix it * cleanup * fix opposite day in ImmutableDruidServer.equals * simplify
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This appears to happen when the balancer runs before the metadata manager polls the metadata database, resulting in the
ifstatement inside of theforloop to never get satisfied. Added max iterations to give up and log about it. I was able to duplicate in local debug cluster, which is where I should've caught this in the first place, my bad.Fixes #5981
It also fixes issues with correctly counting 'moved' and 'unmoved' segments and optimizes cost calculation by removing servers which already have a replica of a segment from being considered as a target server to move a segment to. Previously, sometimes servers which already had the segment would be selected as the 'best' destination to move the segment to, but then the move function would bail out and not do anything, incorrectly counting a segment as 'moved' but without any corresponding log.
Finally
ImmutableDruidServer.equalswas broken, but doesn't appear to be called anywhere other than in this change, so maybe not a big deal (most things deal withServerHolderwhose.equalspicks some properties of
ImmutableDruidServerinstead of calling it's.equalsdirectly.