Skip to content

Netty server does not respond to ruok while initializing cluster#1770

Closed
andrekramer1 wants to merge 3 commits intoapache:masterfrom
andrekramer1:early-ruok
Closed

Netty server does not respond to ruok while initializing cluster#1770
andrekramer1 wants to merge 3 commits intoapache:masterfrom
andrekramer1:early-ruok

Conversation

@andrekramer1
Copy link
Copy Markdown

Apache Pulsar (and others I suspect) use Zookeeper in a Kubernetes statefull set with liveness and ready probes polling the "ruok" Zookeeper command. With the Netty server configured, on later versions of Zookeeper, the first replica would start but never become ready so the statefull set could not scale up from 1 to the desired replica count. This is due to the first replica never replying to "ruok" - it just closes the connection.

Apache Pulsar issue apache/pulsar#11070 reported this failure and this change set was created to get the server to respond to "ruok" while initializing. With these changes the set scales up to the desired 3 replicas.

The issue does not occur with the NIO server context (which is the default) but I've not compared the two to work out exact differences - just modified the Netty one to respond in more cases. There's a tricky issue of disallowing exceedingly large requests as well (in code below) as well as the general question of is it ok to proceed past these checks that were closing the connection. In a multi-threaded server checking a variable isRunning() could be a race in any case so hopefully it's still robust with these changes but they probably need to be taken over by an expert and just used as a starting point for a fix.

Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lvfangmin @breed
have you ever seen this kind of problem ?
do you have any fix in your fork ?

@eolivelli
Copy link
Copy Markdown
Contributor

One problem is the the 4 letters words API is deprecated in favour of the HTTP Admin API.
I am not saying that we should not fix this, but that we are going to fix something that we want to drop one day.

My understanding is that we want to answer something only to "ruok" and not to other commands in this case.

I remember this other PR that is in the same direction
#1520

@andrekramer1
Copy link
Copy Markdown
Author

@eolivelli Hi, I changed the log level to cut down on noise. On the other PR, I also have some checks for zkServer == null to avoid NPEs I saw but I think not for the NPE that the other PR reported. What are the next steps for this - especially with regards to wider potential impact on Zookkeeper?

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Nov 25, 2021

The Zookeeper issue seems to be https://issues.apache.org/jira/browse/ZOOKEEPER-3988

Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

We have to include this in the upcoming 3.7.1 and 3.8.0 releases.

@anmolnar @maoling

Comment thread zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java Outdated
Comment thread zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java Outdated
Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution@andrekramer1. LGTM

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Jan 5, 2022

@anmolnar @maoling please review

// close it before starting the heavy TLS handshake
if (!cnxn.isZKServerRunning()) {
LOG.warn("Zookeeper server is not running, close the connection before starting the TLS handshake");
ServerMetrics.getMetrics().CNXN_CLOSED_WITHOUT_ZK_SERVER_RUNNING.add(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am re-reading the patch again.
It looks like we are no more updating this metric.
and also we are dropping this case.

My understanding is that this check is here to prevent a flood of useless (but heavyweight) TLS handshakes after restarting the ZK node.

I am not sure this is a good move to remove this.
This fix may work on a small cluster (with very few ZK clients I mean)

@lvfangmin If I read correctly (from git blame) this improvement was part of ZOOKEEPER-3682 and the set of patches ported from Facebook ZooKeeper fork.

@eolivelli
Copy link
Copy Markdown
Contributor

@andrekramer1 @anmolnar @lvfangmin I created an alternative, simpler, patch
#1798 that does not change the behaviour to prevent floods

@andrekramer1
Copy link
Copy Markdown
Author

andrekramer1 commented Jan 20, 2022

@eolivelli I think that would not allow it progress to report it's up if SSL is used. Don't know if that is important but was my reason for removing the "flood prevention". Have you tested it solves the 3 node cluster initialization issue for Pulsar?

@eolivelli
Copy link
Copy Markdown
Contributor

@andrekramer1
my understanding is that the problem is that due to the NPE the connection hangs.

if we close the connection when the server is not ready that it should be fine (the probe will retry)

Because to the question "are you okay?" we cannot answer "I am okay" if the server is not ready to process requests.

so:

  • for TLS: no server running -> close the connection
  • for non TLS: no server running -> close the connection (do not throw NPE and let the connection hang)

Have you tested it solves the 3 node cluster initialisation issue for Pulsar?
no I haven't

in NIOServerCnxn if the server is not running we still throw an error and close the connection

@andrekramer1
Copy link
Copy Markdown
Author

There was a change to fix the NPE but it may not have fixed this issue: #1798 Probably needs testing on latest release and meanwhile switching to NIOServerCNXNs should be a work around.

@eolivelli
Copy link
Copy Markdown
Contributor

I am following up.
My goal is to have this fixed and to cut a release

@eolivelli
Copy link
Copy Markdown
Contributor

@andrekramer1
this is my fix #1799

I verified it works.
Please follow up there for new comments

@eolivelli eolivelli closed this Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants