Encountered an issue where http based druid coordinator appears to become stuck if a historical server disappears while in the process of moving a segment. Relevant logs:
2018-03-26T20:01:19,976 INFO [Coordinator-Exec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://disappearing.historical.internal:8283] to load segment[stuck_segment] queued.
2018-03-26T20:01:38,284 INFO [Master-PeonExec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://disappearing.historical.internal:8283] Successfully processed segment[stuck_segment] request[SegmentChangeRequestLoad].
2018-03-26T20:03:50,453 INFO [Coordinator-Exec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://another.historical.internal:8283] to load segment[stuck_segment_1] queued.
2018-03-26T20:04:09,521 INFO [Master-PeonExec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://another.historical.internal:8283] Successfully processed segment[stuck_segment_1] request[SegmentChangeRequestLoad].
2018-03-26T20:19:22,437 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - Moving [stuck_segment] from [disappearing.historical.internal:8283] to [different.historical.internal:8283]
2018-03-26T20:19:22,437 INFO [Coordinator-Exec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://different.historical.internal:8283] to load segment[stuck_segment] queued.
2018-03-26T20:19:47,741 INFO [NodeTypeWatcher[historical]] io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher - Node[disappearing.historical.internal:8283:DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/historical', host='disappearing.historical.internal', port=-1, plaintextPort=8083, enablePlaintextPort=true, tlsPort=8283, enableTlsPort=true}, nodeType='historical', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=1400000000000, type=historical, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}] disappeared.
2018-03-26T20:19:47,741 INFO [CuratorDruidNodeDiscoveryProvider-ListenerExecutor] io.druid.client.HttpServerInventoryView - Server[disappearing.historical.internal:8283] disappeared.
2018-03-26T20:19:47,741 INFO [CuratorDruidNodeDiscoveryProvider-ListenerExecutor] io.druid.server.coordination.ChangeRequestHttpSyncer - Stopped ChangeRequestHttpSyncer[https://disappearing.historical.internal:8283/_1522088641953].
2018-03-26T20:19:47,756 WARN [HttpClient-Netty-Worker-31] io.druid.java.util.http.client.NettyHttpClient - [GET https://disappearing.historical.internal:8283/druid-internal/v1/segments?counter=1058&hash=1522095572371&timeout=240000] Channel disconnected before response complete
2018-03-26T20:19:47,757 INFO [HttpServerInventoryView-0] io.druid.server.coordination.ChangeRequestHttpSyncer - Skipping sync() failure for URL[https://disappearing.historical.internal:8283/_1522088641953].
2018-03-26T20:19:51,057 INFO [pool-25-thread-1] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://disappearing.historical.internal:8283] to drop segment[not_stuck_segment] queued.
2018-03-26T20:19:51,057 INFO [Master-PeonExec--0] io.druid.java.util.http.client.pool.ChannelResourceFactory - Generating: https://disappearing.historical.internal:8283
java.net.ConnectException: Connection refused: disappearing.historical.internal/1.2.3.4:8283
2018-03-26T20:19:51,081 ERROR [Master-PeonExec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Request[https://disappearing.historical.internal:8283/druid-internal/v1/segments/changeRequests?timeout=300000] Failed with status[0]. Reason[null].
Caused by: org.jboss.netty.channel.ChannelException: Failed to connect to host[https://disappearing.historical.internal:8283]
Caused by: java.net.ConnectException: Connection refused: disappearing.historical.internal/1.2.3.4:8283
2018-03-26T20:19:52,467 INFO [Coordinator-Exec--0] io.druid.server.coordinator.DruidCoordinator - Removing listener for server[disappearing.historical.internal:8283] which is no longer there.
2018-03-26T20:19:52,675 INFO [Coordinator-Exec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://other.historical.internal:8283] to load segment[stuck_segment] queued.
2018-03-26T20:26:53,269 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=stuck_segment, server=DruidServerMetadata{name='disappearing.historical.internal:8283', hostAndP
ort='disappearing.historical.internal:8083', hostAndTlsPort='disappearing.historical.internal:8283', maxSize=1400000000000, tier='_default_tier', type=historical, priority=0}}
2018-03-26T20:27:23,292 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=stuck_segment, server=DruidServerMetadata{name='disappearing.historical.internal:8283', hostAndP
ort='disappearing.historical.internal:8083', hostAndTlsPort='disappearing.historical.internal:8283', maxSize=1400000000000, tier='_default_tier', type=historical, priority=0}}
2018-03-26T20:27:53,313 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=stuck_segment, server=DruidServerMetadata{name='disappearing.historical.internal:8283', hostAndP
ort='disappearing.historical.internal:8083', hostAndTlsPort='disappearing.historical.internal:8283', maxSize=1400000000000, tier='_default_tier', type=historical, priority=0}}
2018-03-26T20:28:23,332 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=stuck_segment, server=DruidServerMetadata{name='disappearing.historical.internal:8283', hostAndP
ort='disappearing.historical.internal:8083', hostAndTlsPort='disappearing.historical.internal:8283', maxSize=1400000000000, tier='_default_tier', type=historical, priority=0}}
2018-03-26T20:28:40,837 INFO [Master-PeonExec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://other.historical.internal:8283] Successfully processed segment[stuck_segment] request[SegmentChangeRequestLoad].
2018-03-26T20:28:53,347 INFO [Coordinator-Exec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://different.historical.internal:8283] to drop segment[stuck_segment] queued.
2018-03-26T20:28:53,352 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=stuck_segment, server=DruidServerMetadata{name='disappearing.historical.internal:8283', hostAndP
ort='disappearing.historical.internal:8083', hostAndTlsPort='disappearing.historical.internal:8283', maxSize=1400000000000, tier='_default_tier', type=historical, priority=0}}
2018-03-26T20:28:54,157 INFO [Master-PeonExec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://different.historical.internal:8283] Successfully processed segment[stuck_segment] request[SegmentChangeRequestDrop].
2018-03-26T20:29:16,638 INFO [Master-PeonExec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://different.historical.internal:8283] Successfully processed segment[stuck_segment] request[SegmentChangeRequestLoad].
2018-03-26T20:29:16,638 INFO [pool-25-thread-1] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://disappearing.historical.internal:8283] to drop segment[stuck_segment] queued.
2018-03-26T20:29:23,366 INFO [Coordinator-Exec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://different.historical.internal:8283] to drop segment[stuck_segment] queued.
2018-03-26T20:29:23,370 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=stuck_segment, server=DruidServerMetadata{name='disappearing.historical.internal:8283', hostAndP
ort='disappearing.historical.internal:8083', hostAndTlsPort='disappearing.historical.internal:8283', maxSize=1400000000000, tier='_default_tier', type=historical, priority=0}}
2018-03-26T20:29:25,372 INFO [Master-PeonExec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://different.historical.internal:8283] Successfully processed segment[stuck_segment] request[SegmentChangeRequestDrop].
2018-03-26T20:29:53,384 INFO [Coordinator-Exec--0] io.druid.server.coordinator.HttpLoadQueuePeon - Server[https://different.historical.internal:8283] to drop segment[stuck_segment] queued.
2018-03-26T20:29:53,393 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=stuck_segment, server=DruidServerMetadata{name='disappearing.historical.internal:8283', hostAndP
ort='disappearing.historical.internal:8083', hostAndTlsPort='disappearing.historical.internal:8283', maxSize=1400000000000, tier='_default_tier', type=historical, priority=0}}
Encountered an issue where http based druid coordinator appears to become stuck if a historical server disappears while in the process of moving a segment. Relevant logs:
What appears to be happening here is that the source server of a move operation disappears while the load operation is happening, and the then
HttpLoadQueuePeonis eating the callback and never executing it whendropSegmentis called after the load to the destination is complete.The outer coordinator layer correctly call
stopmethod of the source server's peon, but in the case ofHttpLoadQueuePeontheloadSegmentanddropSegmentmethods do not check if the peon is stopped, so can accumulate callbacks that will NEVER be executed.Then, because the balancer only removes segments from
currentlyMovingSegmentsafter the callback is executed, the segments become stuck until coordinator restart.