Skip to content

Overlord fails to lead and all get stuck in nasty failure loop requiring restart. #4246

@drcrallen

Description

@drcrallen

We had the following exception occur in all overlords at the same time.

Failed to lead: {class=io.druid.indexing.overlord.TaskMaster, exceptionType=class java.lang.reflect.InvocationTargetException, exceptionMessage=null}
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:350)
	at com.metamx.common.lifecycle.Lifecycle.start(Lifecycle.java:259)
	at io.druid.indexing.overlord.TaskMaster$1.takeLeadership(TaskMaster.java:141)
	at org.apache.curator.framework.recipes.leader.LeaderSelector$WrappedListener.takeLeadership(LeaderSelector.java:534)
	at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:399)
	at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:441)
	at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:64)
	at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:245)
	at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:239)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1925ca4 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@6928fd46[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 151]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)
	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
	at com.metamx.common.concurrent.ScheduledExecutors.scheduleAtFixedRate(ScheduledExecutors.java:159)
	at com.metamx.common.concurrent.ScheduledExecutors.scheduleAtFixedRate(ScheduledExecutors.java:135)
	at com.metamx.common.concurrent.ScheduledExecutors.scheduleAtFixedRate(ScheduledExecutors.java:121)
	at io.druid.indexing.overlord.autoscaling.AbstractWorkerResourceManagementStrategy.startManagement(AbstractWorkerResourceManagementStrategy.java:63)
	at io.druid.indexing.overlord.autoscaling.AbstractWorkerResourceManagementStrategy.startManagement(AbstractWorkerResourceManagementStrategy.java:34)
	at io.druid.indexing.overlord.RemoteTaskRunner.start(RemoteTaskRunner.java:312)
	... 19 more

and

Failed to lead: {class=io.druid.indexing.overlord.TaskMaster, exceptionType=class java.lang.reflect.InvocationTargetException, exceptionMessage=null}
sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:350)
        at com.metamx.common.lifecycle.Lifecycle.start(Lifecycle.java:259)
        at io.druid.indexing.overlord.TaskMaster$1.takeLeadership(TaskMaster.java:141)
        at org.apache.curator.framework.recipes.leader.LeaderSelector$WrappedListener.takeLeadership(LeaderSelector.java:534)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:399)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:441)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:64)
        at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:245)
        at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:239)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@2928a0b1 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@6928fd46[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 151]
        at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
        at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)
        at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
        at com.metamx.common.concurrent.ScheduledExecutors.scheduleAtFixedRate(ScheduledExecutors.java:159)
        at com.metamx.common.concurrent.ScheduledExecutors.scheduleAtFixedRate(ScheduledExecutors.java:135)
        at com.metamx.common.concurrent.ScheduledExecutors.scheduleAtFixedRate(ScheduledExecutors.java:121)
        at io.druid.indexing.overlord.autoscaling.AbstractWorkerResourceManagementStrategy.startManagement(AbstractWorkerResourceManagementStrategy.java:63)
        at io.druid.indexing.overlord.autoscaling.AbstractWorkerResourceManagementStrategy.startManagement(AbstractWorkerResourceManagementStrategy.java:34)
        at io.druid.indexing.overlord.RemoteTaskRunner.start(RemoteTaskRunner.java:312)
        ... 18 more

Somehow the overlords ended up in this state and just kept repeating this failure mode. It required a restart of them all to overcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions