Priority on loading for primary replica by xanec · Pull Request #4757 · apache/druid

xanec · 2017-09-06T18:08:24Z

This PR seeks to address a previously-encountered bug that if a LoadRule has multiple tiers, prolonged loading on a (lower) tier seems to "block" another tier from loading even when the latter is available for loading. Preliminary investigation suggests that, depending on the BalancerStrategy and/or configurations (e.g., maxSegmentsInNodeLoadingQueue and replicationThrottleLimit), the circumstances may result in missed opportunity in loading, which could critical when prompt loading in bulk is required (e.g., recovery of historicals).

In the current implementation, the primary replica will be assigned to the tier that comes first in the tieredReplicants map when there is no replica in the cluster and this replica will not be throttled: https://github.com/druid-io/druid/blob/d43687d578bf3dea98b01d0899bcfbb2125d142e/server/src/main/java/io/druid/server/coordinator/rules/LoadRule.java#L130-L134

The main mechanism in the PR uses the priority parameter set on the server to prioritized which server is selected as the holder of the primary replica. The change in this PR simply identifies if the primary replica needs to be loaded (i.e., when totalReplicantsInCluster <= 0) and will prioritize appropriately (i.e., select among the candidate servers the one with the highest priority), instead of using the arbitrary ordering of the hash map.

…nt-loading

drcrallen

I see @leventov is also making changes, so I'm going to submit what I have so far before going on

drcrallen · 2017-09-06T20:11:21Z

+
+    final Optional<ServerHolder> primaryHolderToLoad;
+    if (totalReplicantsInCluster <= 0) {
+      log.trace("No replicants for %s", segment.getIdentifier());


suggest just debug level

drcrallen · 2017-09-06T20:30:23Z

+    final Optional<ServerHolder> primaryHolderToLoad;
+    if (totalReplicantsInCluster <= 0) {
+      log.trace("No replicants for %s", segment.getIdentifier());
+      primaryHolderToLoad = getPrimaryHolder(


how about getPriorityHolder instead?

drcrallen · 2017-09-06T20:34:27Z

+        return stats;
+      }
+    } else {
+      primaryHolderToLoad = Optional.empty();


can this logic flip around? I find the tiny cases easier to read if they are first.

for example change to

if (totalReplicantsInCluster > 0) { primaryHolderToLoad = Optional.empty(); } else { ..... }

drcrallen · 2017-09-06T20:37:33Z

+          serverHolderPredicate
+      );
+
+      if (primaryHolderToLoad.isPresent()) {


this feels strange to not use a functional workflow here, but it is not clear a functional flow would be easier to read.

On the contrary, I removed use of Optional and mapping, because IMO it only adds obscurity. "Too functional" for Java

drcrallen · 2017-09-06T20:38:00Z

+        ++totalReplicantsInCluster;
+      } else {
+        log.trace("No primary holder found for %s", segment.getIdentifier());
+        return stats;


this is hidden in a weird spot. Is this the same behavior as previously?

Also suggest moving this up to the top of the if statement and flipping the boolean for ease of readability.

Sure, I'll move it up.

And yeah, I think I have mistakenly skipped the "drop" part; I will add that in.

drcrallen · 2017-09-06T20:42:33Z

+
+    return candidates
+        .stream()
+        .max((s1, s2) -> Ints.compare(s1.getServer().getPriority(), s2.getServer().getPriority()));


can this get resolved without materializing the whole candidates array?

drcrallen · 2017-09-06T20:44:19Z

+  {
+    final List<ServerHolder> candidates = Lists.newLinkedList();
+
+    for (final Map.Entry<String, Integer> entry : tieredReplicants.entrySet()) {


This workflow could probably be a lot cleaner if rewritten functionally

drcrallen · 2017-09-06T20:49:25Z

+        continue;
+      }
+
+      final ServerHolder candidate = strategy.findNewSegmentHomeReplicator(


If I'm reading this correctly, this is doing a LOT more here compared to what it was previously

I think it should not be that bad. If the findNewSegmentHomeReplicator method call is of concern, then this change will make N - 1 more calls where N is the number of tiers. If we have about 2 tiers per rule, this will make one more call.

leventov · 2017-09-06T20:52:15Z

@@ -57,12 +59,58 @@ public CoordinatorStats run(DruidCoordinator coordinator, DruidCoordinatorRuntim

    final Map<String, Integer> loadStatus = Maps.newHashMap();


@xanec could you please extract some functionality into dedicated methods from this run() method. It's too long.

Sure, do you mind if I do some significant significant refactoring?

I don't mind

drcrallen · 2017-09-06T21:12:19Z


+    final Map<String, Integer> tieredReplicants = getTieredReplicants();
+    for (final String tier : tieredReplicants.keySet()) {
+      stats.addToTieredStat(ASSIGNED_COUNT, tier, 0);


why does this need initialized now?

I don't think this needs initialized. All the components underneath lazily add entries as needed

This is for the expected behavior in once of the tests: https://github.com/druid-io/druid/blob/d43687d578bf3dea98b01d0899bcfbb2125d142e/server/src/test/java/io/druid/server/coordinator/rules/LoadRuleTest.java#L631

Without the initialization, stat3 would not have the hot tier and throw NPE. Note that previously, a 0 will be added into the statistics when no assignment is done:
https://github.com/druid-io/druid/blob/d43687d578bf3dea98b01d0899bcfbb2125d142e/server/src/main/java/io/druid/server/coordinator/rules/LoadRule.java#L102

Personally, I also feel that no initialization is required but I am not sure if receiving 0 is part of the expected usage.

xanec · 2017-09-08T23:53:05Z

@leventov @drcrallen I have implemented the previously requested changes and did some refactoring. Could you kindly review the code again? Thanks.

leventov · 2017-09-11T17:23:16Z

+  }

-      final MinMaxPriorityQueue<ServerHolder> serverQueue = params.getDruidCluster().getHistoricalsByTier(tier);
+  private static Predicate<ServerHolder> createPredicate(final DruidCoordinatorRuntimeParams params)


Could this method be given a more speaking name? E. g. "createLoadQueueSizeLimitingPredicate()"

leventov · 2017-09-11T17:24:13Z

+      final String tier,
+      final DruidCluster druidCluster,
+      final Predicate<ServerHolder> firstPredicate,
+      final Predicate<ServerHolder>... otherPredicates


Suggested for this method just to accept one predicate, and callers call and() on some predicates, if needed

leventov · 2017-09-11T17:31:51Z

        );
+        return numAssigned;
      }
+      holders.remove(holder);


Why added this?

This is to prevent the holder from serving more than one replica of the same segment. During the current run, I believe the assignment will not be immediately reflected in the ServerHolder. Have I mistaken on the expected behavior?

I think you are correct

leventov · 2017-09-11T17:34:34Z

+        log.makeAlert("No holders found for tier[%s]", tier).emit();
+        numDropped = 0;
+      } else {
+        final int numToDrop = entry.getIntValue() - targetReplicants.getOrDefault(tier, 0);


Suggested to extract entry.getIntValue() as "currentReplicants" for readability

leventov · 2017-09-11T17:36:11Z

+    for (final Object2IntMap.Entry<String> entry : targetReplicants.object2IntEntrySet()) {
+      final String tier = entry.getKey();
+      // if there are replicants loading in cluster
+      if (druidCluster.hasTier(tier) && entry.getIntValue() > currentReplicants.getOrDefault(tier, 0)) {


Why if this it true in any tier, we exit the method? maybe still drop segments in other tiers?

Yeah, I also do not understand the complete rational behind this decision but the test cases do enforce such a behavior. May be it is to prevent some form of "thrashing" whereby segments get loaded and dropped excessively?

@fjy are you able to comment here?

@leventov @fjy So should I do any modification for this?

leventov · 2017-09-11T17:37:09Z

+
+  private static int dropForTier(
+      final int numToDrop,
+      final MinMaxPriorityQueue<ServerHolder> holders,


Maybe "tierHolders" or "holdersInTier"

leventov · 2017-09-11T17:37:43Z

+  {
+    int numDropped = 0;
+
+    final List<ServerHolder> droppedHolders = new LinkedList<>();


There is no point to use LinkedList here, please use ArrayList

Sure, but won't a LinkedList will be more suitable here since the size is unknown and we don't need random access? Using a LinkedList will also prevent the problem of array resizing.

No, if created as ArrayList<>(1) it is strictly always more memory efficient than LinkedList, and amortized cost of adding an element is smaller. LinkedList is almost never a good data structure as is (only intrusive linked lists are useful in some forms, sometimes). Exceptions, when LinkedList is useful itself, are so rare that I've never seen them in my practice, and there are no such in the Druid codebase.

leventov · 2017-09-11T17:40:29Z

+
+    final List<ServerHolder> droppedHolders = new LinkedList<>();
+    while (numDropped < numToDrop) {
+      final ServerHolder holder = holders.pollLast();


Why need to remove holders, and then add back? Couldn't the same logic be expressed merely with iteration?

This is because we are using pollLast, an iterator will not give us the same order (i.e., in descending available size).

I inspected all usages of MinMaxPriorityQueue<ServerHolder> in the coordinator code, and it seems to me that poll() or peek() or pollFirst() or peekFirst() is never called on those min-max queues. There are only iterated "directly", elements added to them, and here, min-max is "iterated" in reverse order, via calling pollLast() and then adding elements back.

It means that either

Min-max queue is not needed, it could be a simple PriorityQueue in the reverse order from what is used now to create min-max queues.

Or, if some particular order is expected when those queues are iterated, usage of min-max queues is a mistake, because it doesn't guarantee any particular iteration order. In this case, it should be replaced with TreeSet.

Also @fjy could you please comment here

@leventov @fjy From its use in this case, it seems the order is required. Should I modify the usages of MinMaxPriorityQueue to TreeSet then?

leventov · 2017-09-11T17:43:58Z

-      if (serverQueue == null) {
-        log.makeAlert("Tier[%s] has no servers! Check your cluster configuration!", tier).emit();
+  @SafeVarargs
+  private static List<ServerHolder> getHolderList(


Maybe "getFilteredHolders"

leventov · 2017-09-11T17:49:21Z

+      stats.addToTieredStat(ASSIGNED_COUNT, tier, numAssigned);
+
+      // tier with primary replica
+      final int targetReplicantsInTier = targetReplicants.removeInt(tier);


Maybe add @Nullable String primaryTier parameter to assignReplicas(), to avoid this error-prone remove tier - add tier code.

leventov · 2017-09-11T21:51:58Z

+      final DruidCoordinatorRuntimeParams params,
+      final DataSegment segment,
+      final CoordinatorStats stats,
+      @Nullable final String primaryTier


Also maybe call it "tierToSkip", to make the intention more explicit.

…nt-loading

drcrallen · 2017-09-19T19:22:22Z

  {
    MinMaxPriorityQueue<ServerHolder> servers = historicals.get(tier);
-    return (servers == null) || servers.isEmpty();
+    return (servers != null) && !servers.isEmpty();


This logic flipped?

is it covered in unit tests?

…nt-loading

leventov · 2017-09-21T21:10:08Z

+   * Iterates through each tier and find the respective segment homes; with the found segment homes, selects the one
+   * with the highest priority to be the holder for the primary replica.
+   *
+   * @param params


Please remove empty javadoc stubs

leventov · 2017-09-21T21:12:23Z

+      } else {
+        // cache the result for later use.
+        strategyCache.put(tier, candidate);
+        if (


Shouldn't break line after (

leventov · 2017-09-21T21:12:35Z

+        if (
+            topCandidate == null ||
+            candidate.getServer().getPriority() > topCandidate.getServer().getPriority()
+            ) {


Shouldn't break line before )

leventov · 2017-09-21T21:12:46Z

+  }

-    return stats;
+  /***


leventov · 2017-09-21T21:12:58Z

-    return stats;
+  /***
+   *
+   * @param params


Please remove empty stubs

leventov · 2017-09-21T21:18:13Z

-      if (leftToLoad > 0) {
-        return stats;
+    // This enforces that loading is completed before we attempt to drop stuffs as a safety measure
+    for (final Object2IntMap.Entry<String> entry : targetReplicants.object2IntEntrySet()) {


Please extract this block as a method with boolean result

leventov · 2017-09-26T16:13:01Z

@@ -293,12 +293,8 @@ private void drop(

    // Make sure we have enough loaded replicants in the correct tiers in the cluster before doing anything


With method extracted, this comment line doesn't make much sense to me

leventov · 2017-09-27T16:39:48Z

@xanec could you please remove "Enforce Indentation with Checkstyle" commit from the history?

leventov · 2017-09-27T22:01:36Z

@drcrallen do you have more comments here?

drcrallen · 2017-09-28T02:28:08Z

      DruidCoordinatorRuntimeParams params,
      String tier,
-      MinMaxPriorityQueue<ServerHolder> servers,
+      NavigableSet<ServerHolder> servers,


(Optional) Does this need to be anything other than Collection<ServerHolder>?

For some of the uses of the NavigableSet, it seems that the ordering and uniqueness is implicit in the logic (e.g., in loops). From what I can see, while very very unlikely, changes in these two qualities may alter the behavior. Hence, I have changed the variable to SortedSet instead as a safeguard against future changes to DruidCluster.getSortedHistoricalByTier.

For the other uses, I have changed it to Iterable.

drcrallen · 2017-09-28T02:28:17Z

      Map<String, VersionedIntervalTimeline<String, DataSegment>> timelines = Maps.newHashMap();

-      for (MinMaxPriorityQueue<ServerHolder> serverHolders : cluster.getSortedHistoricalsByTier()) {
+      for (NavigableSet<ServerHolder> serverHolders : cluster.getSortedHistoricalsByTier()) {


(Optional) Does this need to be anything other than Collection<ServerHolder>?

Refer to #4757 (comment)

drcrallen · 2017-09-28T02:28:22Z

    // cleanup before it finished polling the metadata storage for available segments for the first time.
    if (!availableSegments.isEmpty()) {
-      for (MinMaxPriorityQueue<ServerHolder> serverHolders : cluster.getSortedHistoricalsByTier()) {
+      for (NavigableSet<ServerHolder> serverHolders : cluster.getSortedHistoricalsByTier()) {


(Optional) Does this need to be anything other than Collection<ServerHolder>?

Refer to #4757 (comment)

drcrallen · 2017-09-28T02:28:29Z


    log.info("Load Queues:");
-    for (MinMaxPriorityQueue<ServerHolder> serverHolders : cluster.getSortedHistoricalsByTier()) {
+    for (NavigableSet<ServerHolder> serverHolders : cluster.getSortedHistoricalsByTier()) {


(Optional) Does this need to be anything other than Collection<ServerHolder>?

Refer to #4757 (comment)

drcrallen · 2017-09-28T02:36:00Z

-                                              .getTotalReplicants(segment.getIdentifier(), tier);
-      final int loadedReplicantsInTier = params.getSegmentReplicantLookup()
-                                               .getLoadedReplicants(segment.getIdentifier(), tier);
+      // performs


drcrallen · 2017-09-28T02:51:13Z

+          tier,
+          targetReplicants.getOrDefault(tier, 0),
+          // note: adding 1 to currentReplicantsInTier to account for the one assigned as primary replica
+          currentReplicants.getOrDefault(tier, 0) + 1,


currentReplicants.getOrDefault(tier, 0) should always be 0 here right? can this just be a hard coded 1? Actually, I suggest changing the logic here a bit and make int numAssigned = 1; immediately after your assignPrimary call, then make this statement numAssigned += .... with numAssigned passed as a parameter, maybe with a code comment that it will always be 1. IMHO makes what is going on easier to follow.

drcrallen · 2017-09-28T02:59:55Z

        );
+        return numAssigned;
      }
+      holders.remove(holder);


I think you are correct

drcrallen · 2017-09-28T03:01:12Z

+      final DruidCoordinatorRuntimeParams params,
      final DataSegment segment,
-      final DruidCoordinatorRuntimeParams params
+      final CoordinatorStats stats


this is deceiving, it is a final object reference, but the contents are modified. Suggest adding a method comment to such effect.

drcrallen · 2017-09-28T03:01:34Z

+      final int targetReplicantsInTier,
+      final int currentReplicantsInTier,
+      final DruidCoordinatorRuntimeParams params,
+      final List<ServerHolder> holders,


This is modified in the method call, suggest calling that out in the method docs

I have moved the retrieval of ServerHolders into the method to maintain "immutability" of parameters.

drcrallen · 2017-09-28T03:23:53Z

+      final DataSegment segment
+  )
+  {
+    int numDropped = 0;


(Optional) This can be replaced with java 8 awesomeness

return StreamSupport .stream(Spliterators.spliteratorUnknownSize(holdersInTier.descendingIterator(), Spliterator.ORDERED), false) .limit(numToDrop) .filter(sh -> sh.isServingSegment(segment)) .mapToInt(sh -> { sh.getPeon().dropSegment(segment, null); return 1; }) .sum();

Wait, actually this doesn't work but passes tests, which is not good.

return StreamSupport .stream(Spliterators.spliteratorUnknownSize(holdersInTier.descendingIterator(), Spliterator.ORDERED), false) .filter(sh -> sh.isServingSegment(segment)) .limit(numToDrop) .mapToInt(sh -> { sh.getPeon().dropSegment(segment, null); return 1; }) .sum();

is the correct one I think, but highlights a blind spot in the testing.

I think I am going to skip using monad-chaining on this one because we still need to work in the log.warn() into it and it does not seem quite worth it anymore 🤷‍♂️

I've included additional auxiliary fixture to fail the first implementation.

…od call.

drcrallen

Cool, thanks! waiting for more test cases from @xanec . Please feel free to merge once the test cases are in.

* Priority on loading for primary replica * Simplicity fixes * Fix on skipping drop for quick return. * change to debug logging for no replicants. * Fix on filter logic * swapping if-else * Fix on wrong "hasTier" logic * Refactoring of LoadRule * Rename createPredicate to createLoadQueueSizeLimitingPredicate * Rename getHolderList to getFilteredHolders * remove varargs * extract out currentReplicantsInTier * rename holders to holdersInTier * don't do temporary removal of tier. * rename primaryTier to tierToSkip * change LinkedList to ArrayList * Change MinMaxPriorityQueue in DruidCluster to TreeSet. * Adding some comments. * Modify log messages in light of predicates. * Add in-method comments * Don't create new Object2IntOpenHashMap for each run() call. * Cache result from strategy call in the primary assignment to be reused during the same run. * Spelling mistake * Cleaning up javadoc. * refactor out loading in progress check. * Removed redundant comment. * Removed forbidden API * Correct non-forbidden API. * Precision in variable type for NavigableSet. * Obsolete comment. * Clarity in method call and moving retrieval of ServerHolder into method call. * Comment on mutability of CoordinatoorStats. * Added auxiliary fixture for dropping.

xanec and others added 2 commits September 6, 2017 10:17

Priority on loading for primary replica

b38d061

Merge remote-tracking branch 'upstream/master' into prioritized-segme…

aaae2c3

…nt-loading

leventov added the Area - Segment Balancing/Coordination label Sep 6, 2017

Simplicity fixes

e273af1

drcrallen requested changes Sep 6, 2017

View reviewed changes

drcrallen reviewed Sep 6, 2017

View reviewed changes

leventov reviewed Sep 6, 2017

View reviewed changes

drcrallen reviewed Sep 6, 2017

View reviewed changes

xanec added 4 commits September 6, 2017 19:05

Fix on skipping drop for quick return.

1a0e6a7

change to debug logging for no replicants.

1fe5115

Fix on filter logic

a98c0aa

swapping if-else

b98fad7

leventov added the Improvement label Sep 8, 2017

xanec added 2 commits September 8, 2017 16:48

Fix on wrong "hasTier" logic

97cc230

Refactoring of LoadRule

18802ab

leventov requested changes Sep 11, 2017

View reviewed changes

xanec added 6 commits September 11, 2017 11:52

Rename createPredicate to createLoadQueueSizeLimitingPredicate

4ee9f58

Rename getHolderList to getFilteredHolders

d1feeb4

remove varargs

1f5c047

extract out currentReplicantsInTier

441203e

rename holders to holdersInTier

0aedc65

don't do temporary removal of tier.

4b84190

leventov reviewed Sep 11, 2017

View reviewed changes

xanec and others added 4 commits September 11, 2017 15:37

rename primaryTier to tierToSkip

15b6011

change LinkedList to ArrayList

61ebb16

Merge remote-tracking branch 'upstream/master' into prioritized-segme…

8b1e7a9

…nt-loading

Merge remote-tracking branch 'upstream/master' into prioritized-segme…

c4a4276

…nt-loading

drcrallen requested a review from himanshug September 19, 2017 19:13

drcrallen reviewed Sep 19, 2017

View reviewed changes

xanec and others added 2 commits September 21, 2017 08:41

Spelling mistake

0dde8f5

Merge remote-tracking branch 'upstream/master' into prioritized-segme…

49c5cd7

…nt-loading

leventov requested changes Sep 21, 2017

View reviewed changes

xanec added 2 commits September 21, 2017 15:38

Cleaning up javadoc.

558da4a

refactor out loading in progress check.

8784e69

leventov reviewed Sep 26, 2017

View reviewed changes

xanec added 3 commits September 27, 2017 10:04

Removed redundant comment.

bf3d79f

Removed forbidden API

33c6519

Correct non-forbidden API.

d7b7c55

xanec force-pushed the prioritized-segment-loading branch from 0d181a0 to d7b7c55 Compare September 27, 2017 17:12

leventov approved these changes Sep 27, 2017

View reviewed changes

drcrallen requested changes Sep 28, 2017

View reviewed changes

xanec added 4 commits September 28, 2017 07:49

Precision in variable type for NavigableSet.

6630ae3

Obsolete comment.

174d73d

Clarity in method call and moving retrieval of ServerHolder into meth…

aae2df9

…od call.

Comment on mutability of CoordinatoorStats.

cb69cae

drcrallen approved these changes Sep 28, 2017

View reviewed changes

Added auxiliary fixture for dropping.

428a820

drcrallen merged commit 26fd2b3 into apache:master Sep 28, 2017

drcrallen deleted the prioritized-segment-loading branch September 28, 2017 20:02

jon-wei added this to the 0.12.0 milestone Jan 5, 2018

clintropolis mentioned this pull request Mar 24, 2018

Coordinator primary replicant assignment over assigns segments until loaded #5531

Closed

leventov mentioned this pull request Apr 12, 2019

New Coordinator segment balancing/loading algorithm #7458

Open

		@@ -57,12 +59,58 @@ public CoordinatorStats run(DruidCoordinator coordinator, DruidCoordinatorRuntim

		final Map<String, Integer> loadStatus = Maps.newHashMap();

		@@ -293,12 +293,8 @@ private void drop(

		// Make sure we have enough loaded replicants in the correct tiers in the cluster before doing anything

Conversation

xanec commented Sep 6, 2017

Uh oh!

drcrallen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xanec commented Sep 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!