Skip to content

[Proposal] Shutdown druid processes upon complete loss of ZK connectivity #6518

@a2l007

Description

@a2l007

Currently if there is a loss of connectivity between the druid nodes and the zookeeper, the curator attempts connection retries and finally gives up retrying. At this point, the druid node is in a weird state. In case of this happening to a broker, it would still serve queries but provide possibly incorrect results.
Historicals with loss of ZK connectivity would fail to show up on the coordinator console, even the process is still running (which could be tricky for cluster operators to identify).

The proposal that I'm working on is to shutdown the druid process once the connection retries to ZK are exhausted. Shutting down the process would make more sense than the node remaining in an unstable state as the former can trigger configured process alerts or if there is a supervisor process configured, it can restart the druid process.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions