Currently if there is a loss of connectivity between the druid nodes and the zookeeper, the curator attempts connection retries and finally gives up retrying. At this point, the druid node is in a weird state. In case of this happening to a broker, it would still serve queries but provide possibly incorrect results.
Historicals with loss of ZK connectivity would fail to show up on the coordinator console, even the process is still running (which could be tricky for cluster operators to identify).
The proposal that I'm working on is to shutdown the druid process once the connection retries to ZK are exhausted. Shutting down the process would make more sense than the node remaining in an unstable state as the former can trigger configured process alerts or if there is a supervisor process configured, it can restart the druid process.
Currently if there is a loss of connectivity between the druid nodes and the zookeeper, the curator attempts connection retries and finally gives up retrying. At this point, the druid node is in a weird state. In case of this happening to a broker, it would still serve queries but provide possibly incorrect results.
Historicals with loss of ZK connectivity would fail to show up on the coordinator console, even the process is still running (which could be tricky for cluster operators to identify).
The proposal that I'm working on is to shutdown the druid process once the connection retries to ZK are exhausted. Shutting down the process would make more sense than the node remaining in an unstable state as the former can trigger configured process alerts or if there is a supervisor process configured, it can restart the druid process.