Netty server does not respond to ruok while initializing cluster by andrekramer1 · Pull Request #1770 · apache/zookeeper

andrekramer1 · 2021-10-07T10:27:35Z

Apache Pulsar (and others I suspect) use Zookeeper in a Kubernetes statefull set with liveness and ready probes polling the "ruok" Zookeeper command. With the Netty server configured, on later versions of Zookeeper, the first replica would start but never become ready so the statefull set could not scale up from 1 to the desired replica count. This is due to the first replica never replying to "ruok" - it just closes the connection.

Apache Pulsar issue apache/pulsar#11070 reported this failure and this change set was created to get the server to respond to "ruok" while initializing. With these changes the set scales up to the desired 3 replicas.

The issue does not occur with the NIO server context (which is the default) but I've not compared the two to work out exact differences - just modified the Netty one to respond in more cases. There's a tricky issue of disallowing exceedingly large requests as well (in code below) as well as the general question of is it ok to proceed past these checks that were closing the connection. In a multi-threaded server checking a variable isRunning() could be a race in any case so hopefully it's still robust with these changes but they probably need to be taken over by an expert and just used as a starting point for a fix.

eolivelli

@lvfangmin @breed
have you ever seen this kind of problem ?
do you have any fix in your fork ?

eolivelli · 2021-10-07T14:48:00Z

One problem is the the 4 letters words API is deprecated in favour of the HTTP Admin API.
I am not saying that we should not fix this, but that we are going to fix something that we want to drop one day.

My understanding is that we want to answer something only to "ruok" and not to other commands in this case.

I remember this other PR that is in the same direction
#1520

andrekramer1 · 2021-10-15T14:51:37Z

@eolivelli Hi, I changed the log level to cut down on noise. On the other PR, I also have some checks for zkServer == null to avoid NPEs I saw but I think not for the NPE that the other PR reported. What are the next steps for this - especially with regards to wider potential impact on Zookkeeper?

lhotari · 2021-11-25T23:07:07Z

The Zookeeper issue seems to be https://issues.apache.org/jira/browse/ZOOKEEPER-3988

eolivelli

Lgtm

We have to include this in the upcoming 3.7.1 and 3.8.0 releases.

@anmolnar @maoling

lhotari

Thanks for the contribution@andrekramer1. LGTM

lhotari · 2022-01-05T15:35:24Z

@anmolnar @maoling please review

eolivelli · 2022-01-17T07:57:08Z

-            // close it before starting the heavy TLS handshake
-            if (!cnxn.isZKServerRunning()) {
-                LOG.warn("Zookeeper server is not running, close the connection before starting the TLS handshake");
-                ServerMetrics.getMetrics().CNXN_CLOSED_WITHOUT_ZK_SERVER_RUNNING.add(1);


I am re-reading the patch again.
It looks like we are no more updating this metric.
and also we are dropping this case.

My understanding is that this check is here to prevent a flood of useless (but heavyweight) TLS handshakes after restarting the ZK node.

I am not sure this is a good move to remove this.
This fix may work on a small cluster (with very few ZK clients I mean)

@lvfangmin If I read correctly (from git blame) this improvement was part of ZOOKEEPER-3682 and the set of patches ported from Facebook ZooKeeper fork.

eolivelli · 2022-01-20T11:08:09Z

@andrekramer1 @anmolnar @lvfangmin I created an alternative, simpler, patch
#1798 that does not change the behaviour to prevent floods

andrekramer1 · 2022-01-20T11:18:04Z

@eolivelli I think that would not allow it progress to report it's up if SSL is used. Don't know if that is important but was my reason for removing the "flood prevention". Have you tested it solves the 3 node cluster initialization issue for Pulsar?

eolivelli · 2022-01-20T11:26:29Z

@andrekramer1
my understanding is that the problem is that due to the NPE the connection hangs.

if we close the connection when the server is not ready that it should be fine (the probe will retry)

Because to the question "are you okay?" we cannot answer "I am okay" if the server is not ready to process requests.

so:

for TLS: no server running -> close the connection
for non TLS: no server running -> close the connection (do not throw NPE and let the connection hang)

Have you tested it solves the 3 node cluster initialisation issue for Pulsar?
no I haven't

in NIOServerCnxn if the server is not running we still throw an error and close the connection

zookeeper/zookeeper-server/src/main/java/org/apache/zookeeper/server/NIOServerCnxn.java

Line 423 in 94d0c4d

if (!isZKServerRunning()) {

andrekramer1 · 2022-01-25T14:51:42Z

There was a change to fix the NPE but it may not have fixed this issue: #1798 Probably needs testing on latest release and meanwhile switching to NIOServerCNXNs should be a work around.

eolivelli · 2022-01-25T17:08:45Z

I am following up.
My goal is to have this fixed and to cut a release

eolivelli · 2022-01-26T08:51:07Z

@andrekramer1
this is my fix #1799

I verified it works.
Please follow up there for new comments

early ruok for kubernetes ready probe to succeed

b40e18f

andrekramer1 mentioned this pull request Oct 7, 2021

Zookeeper in 3 node Kubernetes cluster does not pass heath check in 2.8.0 apache/pulsar#11070

Closed

eolivelli approved these changes Oct 7, 2021

View reviewed changes

debug log level & remove commented out code

c926b2f

eolivelli approved these changes Jan 2, 2022

View reviewed changes

lhotari suggested changes Jan 5, 2022

View reviewed changes

Comment thread zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java Outdated

lhotari reviewed Jan 5, 2022

View reviewed changes

Comment thread zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java Outdated

review comments

06e65e5

lhotari approved these changes Jan 5, 2022

View reviewed changes

eolivelli requested changes Jan 17, 2022

View reviewed changes

eolivelli mentioned this pull request Jan 20, 2022

ZOOKEEPER-3988 rg.apache.zookeeper.server.NettyServerCnxn.receiveMessage throws NullPointerException #1798

Closed

eolivelli closed this Jan 26, 2022

Conversation

andrekramer1 commented Oct 7, 2021

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli commented Oct 7, 2021

Uh oh!

andrekramer1 commented Oct 15, 2021

Uh oh!

lhotari commented Nov 25, 2021

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Jan 5, 2022

Uh oh!

eolivelli Jan 17, 2022

Choose a reason for hiding this comment

Uh oh!

eolivelli commented Jan 20, 2022

Uh oh!

andrekramer1 commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eolivelli commented Jan 20, 2022

Uh oh!

andrekramer1 commented Jan 25, 2022

Uh oh!

eolivelli commented Jan 25, 2022

Uh oh!

eolivelli commented Jan 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrekramer1 commented Jan 20, 2022 •

edited

Loading