Curator leader election breaks in the overlord when zookeeper has issues

When zookeeper has a blip, the overlord can get into a condition we call "split brain" where leadership election is all screwed up. This causes the state of submitted requests to be non deterministic, sometimes "succeeding" or sometimes registering that it succeeding but not actually returning a success.

We have found two indicators for such a scenario. One is the following error in the logs of the overlord:
```
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /DRUID_PATH/indexer/leaderLatchPath/_c_3c32cd34-4370-4e26-922d-a9b24afa4a91-lock-0000102217
        at com.google.common.base.Throwables.propagate(Throwables.java:160)
        at io.druid.indexing.overlord.TaskMaster.getLeader(TaskMaster.java:251)
        at io.druid.indexing.overlord.http.OverlordRedirectInfo.getRedirectURL(OverlordRedirectInfo.java:52)
        at io.druid.server.http.RedirectFilter.doFilter(RedirectFilter.java:73)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:83)
        at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:364)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:497)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /DRUID_PATH/indexer/leaderLatchPath/_c_3c32cd34-4370-4e26-922d-a9b24afa4a91-lock-0000102217
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1212)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:339)
        at io.druid.indexing.overlord.TaskMaster.getLeader(TaskMaster.java:243)
        ... 22 more
```

( This looks like https://issues.apache.org/jira/browse/CURATOR-358 )


And the other is increased CPU on both active overlords.

The solution is to restart the overlords and let them clean up their state.

In rare scenarios, a middle manager can fail to submit its stuff to the overlord after a peon completes, and will never complete because it can get in a state where the peon already gave up its task lock, but never properly finished its segment insertion at the middle manager level, so the middle manager will retry indefinitely. The fix here is to restart the middle manager, but this looses any outstanding stuff not yet submitted to the overlord.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curator leader election breaks in the overlord when zookeeper has issues #3837

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Curator leader election breaks in the overlord when zookeeper has issues #3837

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions