DruidLeaderSelector interface for leader election and Curator based impl. by himanshug · Pull Request #4699 · apache/druid

himanshug · 2017-08-17T15:45:34Z

Follow up to #4634

Introduces DruidLeaderSelector interface and curator based implementation used at coordinator (in DruidCoordinator) and overlord (in TaskMaster).

note for 0.11.0 release upgrade:
Because overlord leader election algorithm changes with this patch, so it is required to shutdown all overlords and upgrade them and start. There should be no time when two different overlords are not running 0.11.0 during the upgrade.
Note that at least one overlord should be brought up as quickly as possible after shutting them all down so that peons, tranquility etc continue to work after some retries.
druid.zk.paths.indexer.leaderLatchPath is ignored now.

…mpl. DruidCoordinator/TaskMaster are updated to use the new interface.

cheddar · 2017-08-23T20:18:19Z

👍

drcrallen · 2017-08-23T20:27:01Z

+              }
+
+              leader = true;
+              term++;


Doesn't zookeeper already have a term concept? is there any benefit to using another term instead of inheriting the one in zookeeper?

curator LeaderLatch does not expose any term concept. You are probably talking about zookeeper quorums internal leader election.
also, this term is created to keep behavior same as in existing code in DruidCoordinator which was maintaining a local term to keep things in check.

drcrallen · 2017-08-23T20:29:56Z

+              leader = false;
+              try {
+                //Small delay before starting the latch so that others waiting are chosen to become leader.
+                Thread.sleep(1000);


An unlucky GC could easily induce a pause of 1 second. Is there a way to do this without the mandatory pause?

this is probably only place in the whole PR where I have intentionally introduced something that did not exist before. and, this is done to fix following scenario.

say this coordinator node has a problem in becoming leader (e.g. not being able to reach database or something like that) then this needs to tell curator to give up its leadership so that someone else becomes the leader... however if it is too quick in starting next latch then curator may end up choosing this node leader again and keep on repeating the cycle.

this simple artificial pause solves the problem without introducing any issues.

I can't think of a better way right now without introducing a lot of potentially unhelpful complexity.

#3428 is solved by this?

overlord was using curator LeaderSelector and not LeaderLatch , which is a completely different implementation. Most likely #3428 would get solved by this.

For stuff like this, a random sleep works better than a fixed sleep, since a fixed sleep can cause the same sequence of leader changes to play out over and over again. And I think a random sleep is a totally valid mechanism of trying to jolt a leader-election system onto a better leader. It's pretty common in leader-election algorithms.

In this case, maybe a random sleep between 500ms and 5000ms would work.

makes sense, changed random sleep between 1000ms to 5000ms .
min is 1000ms and not 500ms just to be safer and also because I tested with that value mostly.

drcrallen

@himanshug can you please fill out the interface method docs for the new interface? it is impossible to review the PR without actually knowing what the interface is supposed to be doing.

drcrallen · 2017-08-23T20:33:21Z

+            }
+          }
+        },
+        Execs.singleThreaded(StringUtils.format("LeaderSelector[%s]", latchPath))


This is not lifecycled at all. is it possible to have the lifecycle here controlled somehow, to make sure that the executor is not running when this jvm is not the leader?

Also, is it possible to add the term to the string format?

this executor does not change at each term , so term is not a part of the name . also it consists of daemon thread and not needed to be explicitly stopped but only when jvm is shutting down.
note that it needs to be running even when this node is not the leader to get the notification that this node should now become leader , it is the "watcher" passed to curator LeaderLatch .

Gotcha, I misunderstood the scope of this object

@himanshug wait, this recurses via https://github.com/druid-io/druid/pull/4699/files#diff-2ed9a00e63ba80d914302e0f4b2a18b4R93 so it can have a lot of these, right?

hmmm, given that this code existed before and has been working fine, earlier I thought that these would become garbage and GC'd away as no reference to these executors is being maintained.
however I did some tests today to verify that and it appears that yes these objects don't get GC'd and all of them linger in the jvn, don't cause anything bad but waste some memory. I will update the code to fix this. thanks.

@drcrallen updated code to not recreate executor every time but just once.

drcrallen · 2017-08-23T20:34:45Z

+  }
+
+  @Override
+  public String getCurrentLeader()


@Nullable

sure, will add

drcrallen · 2017-08-23T20:35:47Z

+  @Override
+  public boolean isLeader()
+  {
+    return leader;


if the leader status is being contested, should this block until the contest is settled?

while under contest at curator/zookeeper level, this would be false and that is in line with current semantics .

drcrallen · 2017-08-23T20:36:21Z

+
+/**
+ */
+public interface DruidLeaderSelector


This class needs a LOT more documentation on what the method contracts are

sure, will add

drcrallen · 2017-08-23T20:39:05Z

    }
  }
+
+  private static class DruidLeaderSelectorProvider implements Provider<DruidLeaderSelector>


usually to ease testing there is a constructor annotated with @Inject that populates private final fields

this is a utility class only and constructor is explicitly called to provide latchPath which are not part of guice binding , other things are injected and that is why it looks a bit different.

drcrallen · 2017-08-23T20:41:58Z

-                    return ScheduledExecutors.Signal.STOP;
-                  }
+              @Override
+              public ScheduledExecutors.Signal call()


what is the intended behavior if leadership is lost during this call?

this scheduled callable would stop repeating itself ... and will get rescheduled when this node becomes leader again. i think current semantics are retained.

Yes, this also means if lookup management is taking a long time (minutes) then this is going to continue doing lookup management until it either completes or errors out, then it will give up leadership.

In such a scenario it is possible for the new leader to also do lookup management at the same time.

Is there a way to mitigate such a scenario?

well it is mitigated in the sense that even if multiple coordinators are doing lookup management , nothing bad happens... in the worst case, downstream nodes would receive request to load same lookup multiple times and those requests would be ignored.

drcrallen · 2017-08-23T20:44:16Z


+  public String getOverlordPath()
+  {
+    return defaultPath("overlord");


this is different than the other patterns which have a settable path, why does this one need to require only the default path?

i intentionally did not make the latch path configurable . IMO only base zookeeper path should be configurable but all the other locations inside that base path should be dictated by Druid. I am not sure if there is any value in making internal paths configurable.
My understanding is that existing paths were made configurable only to support migration and not that there is any use case for them to be really user configurable.

I understand the reasoning, but this should be consistent with the other items in the class. If it is not needed to have settable paths, then this is not the right PR to make such a change.

earlier we supported druid.zk.paths.indexer.leaderLatchPath property which was being used by overlord and is obtained via IndexerZkConfig. (I'm gonna note this in the release notes)
even if I honor that property in this PR, still overlord is incompatible given that underlying algorithm for leader election changes. Given that compatibility is broken anyway, I don't see the value in making it configurable just because all the other properties in this class are such due to historical reasons. In fact, I would say any newly added property added to this class shouldn't be made configurable.

do you still feel strongly about making if configurable ? :)

himanshug · 2017-08-24T14:19:21Z

@drcrallen thanks for checking the PR. note that intention of this PR is to only extract leader election code out of TaskMaster and DruidCoordinator classes, put it behind an interface and retain existing behavior as much as possible.
Curator based implementation is done in a way to retain existing behavior and changed only slightly to fix a bug as mentioned in #4699 (comment) .

himanshug · 2017-08-24T14:40:13Z

@drcrallen added docs.

jihoonson

Looks good to me overall.

jihoonson · 2017-08-26T05:25:11Z

+  /**
+   * Get ID of current Leader.
+   */
+  @Nullable


Please add to doc when the result is null.

jihoonson · 2017-08-26T05:29:07Z

+  /**
+   * Must be called right after registerLeader(Listener).
+   */
+  void start();


Maybe better to combine two above methods into a single method like registerListenerAndStart().

ok, on second thought I removed start/stop methods altogether and instead have registerListener(listener) and unregisterListener() .

Looks nice!

…rSelector interface

drcrallen · 2017-08-29T20:40:14Z

+
+/**
+ * Interface for supporting Overlord and Coordinator Leader Elections in TaskMaster and DruidCoordinator
+ * which expect appropriate implementation available in guice annotated with @IndexingService and @Coordinator


Suggest adding more comments here that explicitly state that the values returned were true at some point during the call, and may not still be true by the time the caller reads the values.

added more docs to isLeader() and getLeader().

drcrallen · 2017-08-29T21:14:10Z

+    try {
+      final LeaderLatch latch = leaderLatch.get();
+
+      Participant participant = latch.getLeader();


#3837 is still in effect here.

Fix might be outside the scope of this PR though

yeah, i think that would still exist.

drcrallen · 2017-08-29T21:31:40Z

+              log.makeAlert(ex, "listener becomeLeader() failed. Unable to become leader").emit();
+
+              // give others a chance to become leader.
+              final LeaderLatch oldLatch = createNewLeaderLatch();


This is a really interesting way of doing recursion, but seems to match what the code was doing previously.

yeah, i pretty much took the same code

drcrallen · 2017-08-29T21:35:32Z

+    PolyBind.optionBinder(binder, Key.get(DruidLeaderSelector.class, Coordinator.class))
+            .addBinding(CURATOR_KEY)
+            .toProvider(new DruidLeaderSelectorProvider(
+                (zkPathsConfig) -> ZKPaths.makePath(zkPathsConfig.getCoordinatorPath(), "druid:coordinator"))


why is this hard coded here?

it is same as https://github.com/druid-io/druid/blob/master/server/src/main/java/io/druid/server/coordinator/DruidCoordinator.java#L525f just inlined.

(minor) since the path is explicitly for druid, the druid: is redundant. Can it retain the prior naming mechanism of _COORDINATOR?

yeah, changed ... using _COORDINATOR and _OVERLORD now

drcrallen · 2017-08-29T22:23:10Z

@gianm / @himanshug what does the upgrade path look like for Tranquility? How does this upgrade roll out without breaking it?

himanshug · 2017-08-30T19:45:19Z

@drcrallen @gianm does tranquility have enough retries to handle the scenario where all overlord nodes were briefly down?

gianm · 2017-08-31T17:36:21Z

@drcrallen @gianm does tranquility have enough retries to handle the scenario where all overlord nodes were briefly down?

@himanshug by default it retries basically any failure of any kind (overlord or task) for 1 minute and then gives up on that batch of events, and reports it as dropped, and will then move on to the next batch of events. So that is the behavior you would expect to see if all overlords are down for >1 minute.

drcrallen · 2017-08-31T17:52:05Z

The issue here also is consistency of discovery, for Tranqulity using the HTTP connection to overlord, will it be able to discover the overlord properly without a config change

gianm · 2017-08-31T18:03:03Z

The issue here also is consistency of discovery, for Tranqulity using the HTTP connection to overlord, will it be able to discover the overlord properly without a config change

I haven't looked at the code in this PR, but assuming that the overlord and also the running tasks still announce in Curator service discovery when you choose Curator-based leader election, it should all be the same to Tranquility, right @himanshug ?

gianm · 2017-08-31T18:04:30Z

In general I thought one of the goals of this PR was to make it so there is no change in Curator service-discovery behavior if you stick with the Curator impl.

drcrallen · 2017-08-31T19:43:16Z

It is entirely possible my brain crossed some wires when looking through the PR. Trying to keep the different ways announcements are used straight in my head was challenging

…atch

himanshug · 2017-09-01T16:18:50Z

@gianm @drcrallen yes, this PR does not remove the announcement of overlord/coordinator leader inside "external service discovery" . that announcement is currently used by tranquility/. And, also internally by peons and coordinator to discover overlord leader, router to discovery coordinator leader. #4735 sets the stage for removing use of "external service discovery" from internal components.

wrt to tranquility or other things interacting with overlord leader, they might get temporary errors when all overlords are brought down and till at least one of them comes back up. I have updated this information in release notes section of PR description.

also, I think all the comments have been addressed at this point and PR should be merge ready.

drcrallen

@himanshug thanks a ton for your effort in getting this patch to a great state!

drcrallen · 2017-09-01T16:48:59Z

I see 3 👍 's

jon-wei · 2017-09-06T04:54:17Z

+    setupServerAndCurator();
+  }
+
+  @Test(timeout = 5000)


I think we should up this timeout, on my laptop I saw some runs that took between 5-7s, which led to some test failures

hmmm, earlier I did not expect it to take long but i think changing (as per the discussion #4699 (review) ) the sleep from 1 sec to a random value between 1 sec to 5 sec now can make this test take longer as this test exercises that code path. I will update the timeout to 15 secs which should be good enough.

@jon-wei #4756

himanshug added Compatibility Improvement Release Notes labels Aug 17, 2017

himanshug added this to the 0.11.0 milestone Aug 17, 2017

himanshug requested review from cheddar and gianm August 17, 2017 15:45

himanshug force-pushed the membership_pr2_1 branch from b30e2b1 to ba16965 Compare August 17, 2017 16:35

himanshug mentioned this pull request Aug 17, 2017

internal-discovery: interfaces for announcement/discovery, curator based impls #4634

Merged

himanshug force-pushed the membership_pr2_1 branch from ba16965 to 73e135d Compare August 17, 2017 16:43

himanshug added 2 commits August 22, 2017 09:06

DruidLeaderSelector interface for leader election and Curator based i…

15bc860

…mpl. DruidCoordinator/TaskMaster are updated to use the new interface.

add fake DruidNode binding in integration-tests module

26600e4

himanshug force-pushed the membership_pr2_1 branch from 6259080 to 26600e4 Compare August 22, 2017 14:07

drcrallen self-assigned this Aug 22, 2017

drcrallen added the Coordination label Aug 22, 2017

drcrallen reviewed Aug 23, 2017

View reviewed changes

drcrallen requested changes Aug 23, 2017

View reviewed changes

drcrallen added Incompatible and removed Incompatible labels Aug 23, 2017

add docs on DruidLeaderSelector interface

8b3d46b

jihoonson reviewed Aug 26, 2017

View reviewed changes

remove start/stop and keep register/unregister Listener in DruidLeade…

758e3c9

…rSelector interface

jihoonson approved these changes Aug 29, 2017

View reviewed changes

drcrallen reviewed Aug 29, 2017

View reviewed changes

leventov added the Area - ZooKeeper/Curator label Aug 29, 2017

leventov removed the Coordination label Aug 29, 2017

drcrallen reviewed Aug 29, 2017

View reviewed changes

himanshug added 3 commits August 30, 2017 14:01

updated comments on DruidLeaderSelector

b2c6652

cache the listener executor in CuratorDruidLeaderSelector

065faa6

use same latch owner name that was used before

7538f28

himanshug added 2 commits September 1, 2017 10:13

remove stuff related to druid.zk.paths.indexer.leaderLatchPath config

3d8b833

randomize the delay when giving up leadership and restarting leader l…

e0c1e95

…atch

drcrallen approved these changes Sep 1, 2017

View reviewed changes

drcrallen merged commit 06ac667 into apache:master Sep 1, 2017

drcrallen mentioned this pull request Sep 1, 2017

Overlord should backoff on leadership changes #3428

Closed

kevinconaway mentioned this pull request Sep 3, 2017

Expose version property for CustomVersioningPolicy #4747

Merged

jon-wei reviewed Sep 6, 2017

View reviewed changes

himanshug mentioned this pull request Sep 6, 2017

increase test timeout to 15sec as leader election jitter is random sleep between 1 to 5 secs #4756

Merged

jon-wei mentioned this pull request Sep 28, 2017

Druid 0.11.0 release notes #4876

Closed

himanshug mentioned this pull request Nov 2, 2017

Coordinator/Overlord redirect misbehaves if both tls and plainText ports were enabled on servers. #5033

Closed

himanshug deleted the membership_pr2_1 branch December 29, 2017 17:44

himanshug mentioned this pull request Jan 9, 2018

Overlord fails to lead and all get stuck in nasty failure loop requiring restart. #4246

Closed

clambertus unassigned drcrallen Jul 6, 2018

Conversation

himanshug commented Aug 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cheddar commented Aug 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen Aug 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshug Aug 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshug Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshug Aug 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

himanshug commented Aug 17, 2017 •

edited

Loading

drcrallen Aug 29, 2017 •

edited

Loading

himanshug Aug 30, 2017 •

edited

Loading

himanshug Aug 24, 2017 •

edited

Loading

himanshug Aug 30, 2017 •

edited

Loading

himanshug commented Aug 24, 2017 •

edited

Loading