Skip to content

Curator leader election breaks in the overlord when zookeeper has issues #3837

@drcrallen

Description

@drcrallen

When zookeeper has a blip, the overlord can get into a condition we call "split brain" where leadership election is all screwed up. This causes the state of submitted requests to be non deterministic, sometimes "succeeding" or sometimes registering that it succeeding but not actually returning a success.

We have found two indicators for such a scenario. One is the following error in the logs of the overlord:

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /DRUID_PATH/indexer/leaderLatchPath/_c_3c32cd34-4370-4e26-922d-a9b24afa4a91-lock-0000102217
        at com.google.common.base.Throwables.propagate(Throwables.java:160)
        at io.druid.indexing.overlord.TaskMaster.getLeader(TaskMaster.java:251)
        at io.druid.indexing.overlord.http.OverlordRedirectInfo.getRedirectURL(OverlordRedirectInfo.java:52)
        at io.druid.server.http.RedirectFilter.doFilter(RedirectFilter.java:73)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:83)
        at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:364)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:497)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /DRUID_PATH/indexer/leaderLatchPath/_c_3c32cd34-4370-4e26-922d-a9b24afa4a91-lock-0000102217
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1212)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:339)
        at io.druid.indexing.overlord.TaskMaster.getLeader(TaskMaster.java:243)
        ... 22 more

( This looks like https://issues.apache.org/jira/browse/CURATOR-358 )

And the other is increased CPU on both active overlords.

The solution is to restart the overlords and let them clean up their state.

In rare scenarios, a middle manager can fail to submit its stuff to the overlord after a peon completes, and will never complete because it can get in a state where the peon already gave up its task lock, but never properly finished its segment insertion at the middle manager level, so the middle manager will retry indefinitely. The fix here is to restart the middle manager, but this looses any outstanding stuff not yet submitted to the overlord.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions