When rolling out the 0.12.2 branch to our test clusters, we noticed symptoms that suggest #5927 can hork up coordinators. They report a lot of time spent in these stack traces, and one of them has spent hours now without finishing a run:
"Coordinator-Exec--0" #120 daemon prio=5 os_prio=0 tid=0x00007f06802d4000 nid=0x20f7 runnable [0x00007f066a7d0000]
java.lang.Thread.State: RUNNABLE
at io.druid.server.coordinator.ReservoirSegmentSampler.getRandomBalancerSegmentHolder(ReservoirSegmentSampler.java:46)
at io.druid.server.coordinator.CostBalancerStrategy.pickSegmentToMove(CostBalancerStrategy.java:224)
at io.druid.server.coordinator.helper.DruidCoordinatorBalancer.balanceTier(DruidCoordinatorBalancer.java:128)
at io.druid.server.coordinator.helper.DruidCoordinatorBalancer.lambda$run$0(DruidCoordinatorBalancer.java:84)
at io.druid.server.coordinator.helper.DruidCoordinatorBalancer$$Lambda$52/955068914.accept(Unknown Source)
at java.util.HashMap.forEach(HashMap.java:1289)
at io.druid.server.coordinator.helper.DruidCoordinatorBalancer.run(DruidCoordinatorBalancer.java:83)
at io.druid.server.coordinator.DruidCoordinator$CoordinatorRunnable.run(DruidCoordinator.java:677)
at io.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:571)
at io.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:564)
at io.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:102)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
So for 0.12.2 we should either revert this patch, or try to achieve the same thing in some other way.
/cc @clintropolis
When rolling out the 0.12.2 branch to our test clusters, we noticed symptoms that suggest #5927 can hork up coordinators. They report a lot of time spent in these stack traces, and one of them has spent hours now without finishing a run:
So for 0.12.2 we should either revert this patch, or try to achieve the same thing in some other way.
/cc @clintropolis