Netty server does not respond to ruok while initializing cluster#1770
Netty server does not respond to ruok while initializing cluster#1770andrekramer1 wants to merge 3 commits intoapache:masterfrom
Conversation
eolivelli
left a comment
There was a problem hiding this comment.
@lvfangmin @breed
have you ever seen this kind of problem ?
do you have any fix in your fork ?
|
One problem is the the 4 letters words API is deprecated in favour of the HTTP Admin API. My understanding is that we want to answer something only to "ruok" and not to other commands in this case. I remember this other PR that is in the same direction |
|
@eolivelli Hi, I changed the log level to cut down on noise. On the other PR, I also have some checks for zkServer == null to avoid NPEs I saw but I think not for the NPE that the other PR reported. What are the next steps for this - especially with regards to wider potential impact on Zookkeeper? |
|
The Zookeeper issue seems to be https://issues.apache.org/jira/browse/ZOOKEEPER-3988 |
lhotari
left a comment
There was a problem hiding this comment.
Thanks for the contribution@andrekramer1. LGTM
| // close it before starting the heavy TLS handshake | ||
| if (!cnxn.isZKServerRunning()) { | ||
| LOG.warn("Zookeeper server is not running, close the connection before starting the TLS handshake"); | ||
| ServerMetrics.getMetrics().CNXN_CLOSED_WITHOUT_ZK_SERVER_RUNNING.add(1); |
There was a problem hiding this comment.
I am re-reading the patch again.
It looks like we are no more updating this metric.
and also we are dropping this case.
My understanding is that this check is here to prevent a flood of useless (but heavyweight) TLS handshakes after restarting the ZK node.
I am not sure this is a good move to remove this.
This fix may work on a small cluster (with very few ZK clients I mean)
@lvfangmin If I read correctly (from git blame) this improvement was part of ZOOKEEPER-3682 and the set of patches ported from Facebook ZooKeeper fork.
|
@andrekramer1 @anmolnar @lvfangmin I created an alternative, simpler, patch |
|
@eolivelli I think that would not allow it progress to report it's up if SSL is used. Don't know if that is important but was my reason for removing the "flood prevention". Have you tested it solves the 3 node cluster initialization issue for Pulsar? |
|
@andrekramer1 if we close the connection when the server is not ready that it should be fine (the probe will retry) Because to the question "are you okay?" we cannot answer "I am okay" if the server is not ready to process requests. so:
in NIOServerCnxn if the server is not running we still throw an error and close the connection |
|
There was a change to fix the NPE but it may not have fixed this issue: #1798 Probably needs testing on latest release and meanwhile switching to NIOServerCNXNs should be a work around. |
|
I am following up. |
|
@andrekramer1 I verified it works. |
Apache Pulsar (and others I suspect) use Zookeeper in a Kubernetes statefull set with liveness and ready probes polling the "ruok" Zookeeper command. With the Netty server configured, on later versions of Zookeeper, the first replica would start but never become ready so the statefull set could not scale up from 1 to the desired replica count. This is due to the first replica never replying to "ruok" - it just closes the connection.
Apache Pulsar issue apache/pulsar#11070 reported this failure and this change set was created to get the server to respond to "ruok" while initializing. With these changes the set scales up to the desired 3 replicas.
The issue does not occur with the NIO server context (which is the default) but I've not compared the two to work out exact differences - just modified the Netty one to respond in more cases. There's a tricky issue of disallowing exceedingly large requests as well (in code below) as well as the general question of is it ok to proceed past these checks that were closing the connection. In a multi-threaded server checking a variable isRunning() could be a race in any case so hopefully it's still robust with these changes but they probably need to be taken over by an expert and just used as a starting point for a fix.