DruidLeaderSelector interface for leader election and Curator based impl.#4699
DruidLeaderSelector interface for leader election and Curator based impl.#4699drcrallen merged 9 commits intoapache:masterfrom
Conversation
b30e2b1 to
ba16965
Compare
ba16965 to
73e135d
Compare
…mpl. DruidCoordinator/TaskMaster are updated to use the new interface.
6259080 to
26600e4
Compare
|
👍 |
| } | ||
|
|
||
| leader = true; | ||
| term++; |
There was a problem hiding this comment.
Doesn't zookeeper already have a term concept? is there any benefit to using another term instead of inheriting the one in zookeeper?
There was a problem hiding this comment.
curator LeaderLatch does not expose any term concept. You are probably talking about zookeeper quorums internal leader election.
also, this term is created to keep behavior same as in existing code in DruidCoordinator which was maintaining a local term to keep things in check.
| leader = false; | ||
| try { | ||
| //Small delay before starting the latch so that others waiting are chosen to become leader. | ||
| Thread.sleep(1000); |
There was a problem hiding this comment.
An unlucky GC could easily induce a pause of 1 second. Is there a way to do this without the mandatory pause?
There was a problem hiding this comment.
this is probably only place in the whole PR where I have intentionally introduced something that did not exist before. and, this is done to fix following scenario.
- say this coordinator node has a problem in becoming leader (e.g. not being able to reach database or something like that) then this needs to tell curator to give up its leadership so that someone else becomes the leader... however if it is too quick in starting next latch then curator may end up choosing this node leader again and keep on repeating the cycle.
this simple artificial pause solves the problem without introducing any issues.
There was a problem hiding this comment.
I can't think of a better way right now without introducing a lot of potentially unhelpful complexity.
There was a problem hiding this comment.
overlord was using curator LeaderSelector and not LeaderLatch , which is a completely different implementation. Most likely #3428 would get solved by this.
There was a problem hiding this comment.
For stuff like this, a random sleep works better than a fixed sleep, since a fixed sleep can cause the same sequence of leader changes to play out over and over again. And I think a random sleep is a totally valid mechanism of trying to jolt a leader-election system onto a better leader. It's pretty common in leader-election algorithms.
In this case, maybe a random sleep between 500ms and 5000ms would work.
There was a problem hiding this comment.
makes sense, changed random sleep between 1000ms to 5000ms .
min is 1000ms and not 500ms just to be safer and also because I tested with that value mostly.
drcrallen
left a comment
There was a problem hiding this comment.
@himanshug can you please fill out the interface method docs for the new interface? it is impossible to review the PR without actually knowing what the interface is supposed to be doing.
| } | ||
| } | ||
| }, | ||
| Execs.singleThreaded(StringUtils.format("LeaderSelector[%s]", latchPath)) |
There was a problem hiding this comment.
This is not lifecycled at all. is it possible to have the lifecycle here controlled somehow, to make sure that the executor is not running when this jvm is not the leader?
There was a problem hiding this comment.
Also, is it possible to add the term to the string format?
There was a problem hiding this comment.
this executor does not change at each term , so term is not a part of the name . also it consists of daemon thread and not needed to be explicitly stopped but only when jvm is shutting down.
note that it needs to be running even when this node is not the leader to get the notification that this node should now become leader , it is the "watcher" passed to curator LeaderLatch .
There was a problem hiding this comment.
Gotcha, I misunderstood the scope of this object
There was a problem hiding this comment.
@himanshug wait, this recurses via https://github.com/druid-io/druid/pull/4699/files#diff-2ed9a00e63ba80d914302e0f4b2a18b4R93 so it can have a lot of these, right?
There was a problem hiding this comment.
hmmm, given that this code existed before and has been working fine, earlier I thought that these would become garbage and GC'd away as no reference to these executors is being maintained.
however I did some tests today to verify that and it appears that yes these objects don't get GC'd and all of them linger in the jvn, don't cause anything bad but waste some memory. I will update the code to fix this. thanks.
There was a problem hiding this comment.
@drcrallen updated code to not recreate executor every time but just once.
| } | ||
|
|
||
| @Override | ||
| public String getCurrentLeader() |
| @Override | ||
| public boolean isLeader() | ||
| { | ||
| return leader; |
There was a problem hiding this comment.
if the leader status is being contested, should this block until the contest is settled?
There was a problem hiding this comment.
while under contest at curator/zookeeper level, this would be false and that is in line with current semantics .
|
|
||
| /** | ||
| */ | ||
| public interface DruidLeaderSelector |
There was a problem hiding this comment.
This class needs a LOT more documentation on what the method contracts are
| } | ||
| } | ||
|
|
||
| private static class DruidLeaderSelectorProvider implements Provider<DruidLeaderSelector> |
There was a problem hiding this comment.
usually to ease testing there is a constructor annotated with @Inject that populates private final fields
There was a problem hiding this comment.
this is a utility class only and constructor is explicitly called to provide latchPath which are not part of guice binding , other things are injected and that is why it looks a bit different.
| return ScheduledExecutors.Signal.STOP; | ||
| } | ||
| @Override | ||
| public ScheduledExecutors.Signal call() |
There was a problem hiding this comment.
what is the intended behavior if leadership is lost during this call?
There was a problem hiding this comment.
this scheduled callable would stop repeating itself ... and will get rescheduled when this node becomes leader again. i think current semantics are retained.
There was a problem hiding this comment.
Yes, this also means if lookup management is taking a long time (minutes) then this is going to continue doing lookup management until it either completes or errors out, then it will give up leadership.
In such a scenario it is possible for the new leader to also do lookup management at the same time.
Is there a way to mitigate such a scenario?
There was a problem hiding this comment.
well it is mitigated in the sense that even if multiple coordinators are doing lookup management , nothing bad happens... in the worst case, downstream nodes would receive request to load same lookup multiple times and those requests would be ignored.
|
|
||
| public String getOverlordPath() | ||
| { | ||
| return defaultPath("overlord"); |
There was a problem hiding this comment.
this is different than the other patterns which have a settable path, why does this one need to require only the default path?
There was a problem hiding this comment.
i intentionally did not make the latch path configurable . IMO only base zookeeper path should be configurable but all the other locations inside that base path should be dictated by Druid. I am not sure if there is any value in making internal paths configurable.
My understanding is that existing paths were made configurable only to support migration and not that there is any use case for them to be really user configurable.
There was a problem hiding this comment.
I understand the reasoning, but this should be consistent with the other items in the class. If it is not needed to have settable paths, then this is not the right PR to make such a change.
There was a problem hiding this comment.
earlier we supported druid.zk.paths.indexer.leaderLatchPath property which was being used by overlord and is obtained via IndexerZkConfig. (I'm gonna note this in the release notes)
even if I honor that property in this PR, still overlord is incompatible given that underlying algorithm for leader election changes. Given that compatibility is broken anyway, I don't see the value in making it configurable just because all the other properties in this class are such due to historical reasons. In fact, I would say any newly added property added to this class shouldn't be made configurable.
do you still feel strongly about making if configurable ? :)
|
@drcrallen thanks for checking the PR. note that intention of this PR is to only extract leader election code out of TaskMaster and DruidCoordinator classes, put it behind an interface and retain existing behavior as much as possible. |
|
@drcrallen added docs. |
jihoonson
left a comment
There was a problem hiding this comment.
Looks good to me overall.
| /** | ||
| * Get ID of current Leader. | ||
| */ | ||
| @Nullable |
There was a problem hiding this comment.
Please add to doc when the result is null.
| /** | ||
| * Must be called right after registerLeader(Listener). | ||
| */ | ||
| void start(); |
There was a problem hiding this comment.
Maybe better to combine two above methods into a single method like registerListenerAndStart().
There was a problem hiding this comment.
ok, on second thought I removed start/stop methods altogether and instead have registerListener(listener) and unregisterListener() .
…rSelector interface
|
|
||
| /** | ||
| * Interface for supporting Overlord and Coordinator Leader Elections in TaskMaster and DruidCoordinator | ||
| * which expect appropriate implementation available in guice annotated with @IndexingService and @Coordinator |
There was a problem hiding this comment.
Suggest adding more comments here that explicitly state that the values returned were true at some point during the call, and may not still be true by the time the caller reads the values.
There was a problem hiding this comment.
added more docs to isLeader() and getLeader().
| try { | ||
| final LeaderLatch latch = leaderLatch.get(); | ||
|
|
||
| Participant participant = latch.getLeader(); |
There was a problem hiding this comment.
Fix might be outside the scope of this PR though
There was a problem hiding this comment.
yeah, i think that would still exist.
| log.makeAlert(ex, "listener becomeLeader() failed. Unable to become leader").emit(); | ||
|
|
||
| // give others a chance to become leader. | ||
| final LeaderLatch oldLatch = createNewLeaderLatch(); |
There was a problem hiding this comment.
This is a really interesting way of doing recursion, but seems to match what the code was doing previously.
There was a problem hiding this comment.
yeah, i pretty much took the same code
| PolyBind.optionBinder(binder, Key.get(DruidLeaderSelector.class, Coordinator.class)) | ||
| .addBinding(CURATOR_KEY) | ||
| .toProvider(new DruidLeaderSelectorProvider( | ||
| (zkPathsConfig) -> ZKPaths.makePath(zkPathsConfig.getCoordinatorPath(), "druid:coordinator")) |
There was a problem hiding this comment.
why is this hard coded here?
There was a problem hiding this comment.
There was a problem hiding this comment.
(minor) since the path is explicitly for druid, the druid: is redundant. Can it retain the prior naming mechanism of _COORDINATOR?
There was a problem hiding this comment.
yeah, changed ... using _COORDINATOR and _OVERLORD now
|
@gianm / @himanshug what does the upgrade path look like for Tranquility? How does this upgrade roll out without breaking it? |
|
@drcrallen @gianm does tranquility have enough retries to handle the scenario where all overlord nodes were briefly down? |
@himanshug by default it retries basically any failure of any kind (overlord or task) for 1 minute and then gives up on that batch of events, and reports it as dropped, and will then move on to the next batch of events. So that is the behavior you would expect to see if all overlords are down for >1 minute. |
|
The issue here also is consistency of discovery, for Tranqulity using the HTTP connection to overlord, will it be able to discover the overlord properly without a config change |
I haven't looked at the code in this PR, but assuming that the overlord and also the running tasks still announce in Curator service discovery when you choose Curator-based leader election, it should all be the same to Tranquility, right @himanshug ? |
|
In general I thought one of the goals of this PR was to make it so there is no change in Curator service-discovery behavior if you stick with the Curator impl. |
|
It is entirely possible my brain crossed some wires when looking through the PR. Trying to keep the different ways announcements are used straight in my head was challenging |
|
@gianm @drcrallen yes, this PR does not remove the announcement of overlord/coordinator leader inside "external service discovery" . that announcement is currently used by tranquility/. And, also internally by peons and coordinator to discover overlord leader, router to discovery coordinator leader. #4735 sets the stage for removing use of "external service discovery" from internal components. wrt to tranquility or other things interacting with overlord leader, they might get temporary errors when all overlords are brought down and till at least one of them comes back up. I have updated this information in release notes section of PR description. also, I think all the comments have been addressed at this point and PR should be merge ready. |
drcrallen
left a comment
There was a problem hiding this comment.
@himanshug thanks a ton for your effort in getting this patch to a great state!
|
I see 3 👍 's |
| setupServerAndCurator(); | ||
| } | ||
|
|
||
| @Test(timeout = 5000) |
There was a problem hiding this comment.
I think we should up this timeout, on my laptop I saw some runs that took between 5-7s, which led to some test failures
There was a problem hiding this comment.
hmmm, earlier I did not expect it to take long but i think changing (as per the discussion #4699 (review) ) the sleep from 1 sec to a random value between 1 sec to 5 sec now can make this test take longer as this test exercises that code path. I will update the timeout to 15 secs which should be good enough.
Follow up to #4634
Introduces
DruidLeaderSelectorinterface and curator based implementation used at coordinator (in DruidCoordinator) and overlord (in TaskMaster).note for 0.11.0 release upgrade:
Because overlord leader election algorithm changes with this patch, so it is required to shutdown all overlords and upgrade them and start. There should be no time when two different overlords are not running 0.11.0 during the upgrade.
Note that at least one overlord should be brought up as quickly as possible after shutting them all down so that peons, tranquility etc continue to work after some retries.
druid.zk.paths.indexer.leaderLatchPathis ignored now.