Skip to content

Problems when ZK Leader shuts down #308

@sschepens

Description

@sschepens

@merlimat we had to replace our current local ZK leader in a cluster and this seems to cause a LOT of issues in the cluster.
Brokers seem to have shut down all at the same time, leaving the cluster unable to handle traffic until restarted.
Also, a lot of consumers seem to have been reset to a previous moment in time, generating a huge amount of backlog.

We see these logs before the broker apparently shut down:

March 22nd 2017, 15:31:43.156	2017-03-22 18:31:43,155 - INFO  - [main-SendThread(ip-10-64-102-223.ec2.internal:2181):ClientCnxn$SendThread@1158] - Unable to read additional data from server sessionid 0x35a384d476e093b, likely server has closed socket, closing socket connection and attempting reconnect
March 22nd 2017, 15:31:43.258	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:1 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.258	2017-03-22 18:31:43,258 - WARN  - [main-EventThread:LeaderElectionService$1@111] - Got something wrong on watch: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.258	2017-03-22 18:31:43,258 - WARN  - [main-EventThread:LeaderElectionService$1@92] - Type of the event is [None] and path is [null]
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:1 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,259 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,259 - INFO  - [main-EventThread:ZooKeeperSessionWatcher@87] - Received zookeeper notification, eventType=None, eventState=Disconnected
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,259 - INFO  - [main-EventThread:ZooKeeperCache@346] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.587	2017-03-22 18:31:43,586 - INFO  - [main-SendThread(ip-10-64-102-117.ec2.internal:2181):ClientCnxn$SendThread@1032] - Opening socket connection to server ip-10-64-102-117.ec2.internal/10.64.102.117:2181. Will not attempt to authenticate using SASL (unknown error)

A couple of questions:
1 - Why do brokers shutdown on ZK Leader disconnection?
2 - Why could this affect the backlog of consumers? we're running with a branch that persists individualDeletedMessages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    deprecated/questionQuestions should happened in GitHub Discussions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions