[Issue 11070][Zookeeper] Fix netcat returning early for probe by frederic-kneier · Pull Request #14088 · apache/pulsar

frederic-kneier · 2022-02-01T16:46:46Z

Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.

Fixes #11070

Motivation

Readiness and liveness probes in Kubernetes fail in Kubernetes in some cases because the check does not wait for a response.

Modifications

The check script now waits for 1 seconds for a response.

Verifying this change

Since this problem is caused by a race condition, testing is a bit complicated.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API: no
The schema: no
The default values of configurations: no
The wire protocol: no
The rest endpoints: no
The admin cli options: no
Anything that affects deployment: don't know

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

[ x ] no-need-doc

The actual intention of the script does not change.

Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.

michaeljmarshall

@frederic-kneier would you mind providing a little more explanation for this change and the problem its solving? In looking at the ZK documentation (https://zookeeper.apache.org/doc/r3.7.0/zookeeperAdmin.html), I don't see any indication that users should need to use the -q argument. Based on my understanding, echo ruok is not generating an EOF, so the -q argument won't change the script's behavior.

frederic-kneier · 2022-02-01T19:21:22Z

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.

for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok
for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok

lhotari · 2022-02-01T19:44:07Z

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.

for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok

Thanks for the explanation @frederic-kneier .
Btw. I've been struggling with the Zookeeper probes and this has been causing some instability in https://github.com/apache/pulsar-helm-chart . Some attempts to improve the situation:
apache/pulsar-helm-chart#220
apache/pulsar-helm-chart#214
apache/pulsar-helm-chart#202

I just wonder if the value for -q should be more than 1?

lhotari · 2022-02-01T19:46:17Z

Since the cause of the problem is a race condition

@frederic-kneier could you share a reference to the race condition that you are referring? It would be interesting to learn more about that. Thanks! Great work on this issue!

lhotari · 2022-02-01T19:58:12Z

I did some googling and there's -q -1 in this answer: https://unix.stackexchange.com/a/274603/ .

- see apache/pulsar#14088, https://unix.stackexchange.com/a/274603/

lhotari · 2022-02-01T20:05:29Z

I'm running this experiment in the Apache Pulsar Helm Chart: apache/pulsar-helm-chart@98dd029 . Let's see if the tests finally pass.

michaeljmarshall

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.

for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok

@frederic-kneier - thank you for the explanation and the example! That definitely makes sense. I agree with @lhotari in wondering if the value should be more than 1? We could defer on choosing the value and consider making it configurable, too.

frederic-kneier · 2022-02-01T21:33:57Z

@lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It has to be "echo ruok | nc -q 1 localhost 2181"

lhotari · 2022-02-02T05:56:32Z

@lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It has to be "echo ruok | nc -q 1 localhost 2181"

@frederic-kneier ok, good to hear about that. I'm trying to understand the reason why it fixes the problem.

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply.

In this case, this explanation doesn't seem to hold. The apache/pulsar:2.8.2 image contains netcat-openbsd and for it, -q -1 is the default and would prevent closing the socket after stdin EOF.

Here's some parts of the man page for netcat-openbsd nc

    -N      shutdown(2) the network socket after EOF on the input.  Some servers require this to finish their work.

     -q seconds
             after EOF on stdin, wait the specified number of seconds and then quit. If seconds is negative, wait forever (default).  Specifying a non-negative seconds implies -N.

It says "Some servers require this to finish their work". I wonder why this is the case for Zookeeper. It seems to happen only when Zookeeper is configured using org.apache.zookeeper.server.NettyServerCnxnFactory.

lhotari

This is an awesome finding, but let's hold back merging until we have an explanation why -q 1 solves the problem.

lhotari · 2022-02-02T06:32:13Z

I found some explanation in https://stackoverflow.com/questions/4160347/close-vs-shutdown-socket/23483487 .
netcat will send a FIN and close the connection cleanly when using -q 1. I guess that the default for netcat-openbsd is that it will wait for the other end to close the connection unless -q 1 is specified.
It feels like a bug in Zookeeper, but I'm fine if this mitigates it. I'm just wonder what other consequences there could be.

- see apache/pulsar#14088

lhotari · 2022-02-02T06:47:47Z

I made a similar change to the Apache Pulsar Helm chart: apache/pulsar-helm-chart#223 . Instead of relying on the pulsar-zookeeper-ruok.sh script in the container image, I replaced it with bash -c 'echo ruok | nc -q 1 localhost 2181 | grep imok'.

lhotari · 2022-02-02T07:58:26Z

the -q 1 didn't solve the issue in Pulsar Helm Chart CI tests when TLS is enabled for Zookeeper.
I made a change to send the "ruok" to the TLS port when TLS is enabled. The command used is bash -c 'echo ruok | openssl s_client -quiet -crlf -connect localhost:2281 -cert /pulsar/certs/zookeeper/tls.crt -key /pulsar/certs/zookeeper/tls.key | grep imok'. A similar approach is used in the Bitnami Zookeeper Helm chart for Zookeeper probes. This doesn't resolve the issue and TLS tests still fail occasionally. This means that the problem is elsewhere. Perhaps the ZK fix apache/zookeeper#1800 resolves it.

- see apache/pulsar#14088

…ecify "-q 1" for nc (#223) - NOTICE: we are no more using "bin/pulsar-zookeeper-ruok.sh" from the apachepulsar/pulsar docker image. The probe script is part of the chart. * Pass "-q 1" to netcat (nc) to fix issue with Zookeeper ruok probe - see apache/pulsar#14088 * Send ruok to TLS port when TLS is enabled * Bump chart version

Fix netcat returning early

4c0eaab

Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.

github-actions Bot assigned frederic-kneier Feb 1, 2022

github-actions Bot added the doc-not-needed Your PR changes do not impact docs label Feb 1, 2022

merlimat approved these changes Feb 1, 2022

View reviewed changes

merlimat added this to the 2.10.0 milestone Feb 1, 2022

merlimat added component/deploy type/bug The PR fixed a bug or issue reported a bug labels Feb 1, 2022

michaeljmarshall requested a review from eolivelli February 1, 2022 18:18

michaeljmarshall reviewed Feb 1, 2022

View reviewed changes

lhotari mentioned this pull request Feb 1, 2022

Bump to Pulsar 2.8.2 apache/pulsar-helm-chart#190

Closed

lhotari approved these changes Feb 1, 2022

View reviewed changes

lhotari added a commit to 315157973/pulsar-helm-chart that referenced this pull request Feb 1, 2022

Pass "-q -1" to fix race condition

98dd029

- see apache/pulsar#14088, https://unix.stackexchange.com/a/274603/

lhotari mentioned this pull request Feb 1, 2022

Zookeeper Pod Restarts frequently apache/pulsar-helm-chart#222

Closed

michaeljmarshall approved these changes Feb 1, 2022

View reviewed changes

lhotari requested changes Feb 2, 2022

View reviewed changes

lhotari approved these changes Feb 2, 2022

View reviewed changes

lhotari added a commit to 315157973/pulsar-helm-chart that referenced this pull request Feb 2, 2022

Pass "-q 1" to netcat (nc) to fix issue with Zookeeper ruok probe

133c9f5

- see apache/pulsar#14088

lhotari added a commit to lhotari/pulsar-helm-chart that referenced this pull request Feb 2, 2022

Pass "-q 1" to netcat (nc) to fix issue with Zookeeper ruok probe

8d72d6d

- see apache/pulsar#14088

lhotari mentioned this pull request Feb 2, 2022

Improve Zookeeper "ruok" probes: use TLS port when TLS is enabled, specify "-q 1" for nc apache/pulsar-helm-chart#223

Merged

michaeljmarshall added the component/zookeeper label Feb 2, 2022

codelipenghui approved these changes Feb 6, 2022

View reviewed changes

lhotari added a commit to lhotari/pulsar-helm-chart that referenced this pull request Feb 7, 2022

Pass "-q 1" to netcat (nc) to fix issue with Zookeeper ruok probe

0872aa0

- see apache/pulsar#14088

codelipenghui merged commit 9284d42 into apache:master Feb 8, 2022

lhotari mentioned this pull request Mar 8, 2022

Upgrade Zookeeper to 3.8.0 #14601

Merged

Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022

Fix netcat returning early (apache#14088)

e8b8cd7

lhotari mentioned this pull request Dec 11, 2023

Fix netcat zookeeper connect command to make it work in linux machines apache/pulsar-helm-chart#383

Closed

1 task

lhotari mentioned this pull request Jun 7, 2024

[Bug] bin/pulsar-zookeeper-ruok.sh fails with apachepulsar/pulsar:3.3.0 image #22872

Closed

3 tasks

Conversation

frederic-kneier commented Feb 1, 2022

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

michaeljmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

frederic-kneier commented Feb 1, 2022

Uh oh!

lhotari commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Feb 1, 2022

Uh oh!

lhotari commented Feb 1, 2022

Uh oh!

lhotari commented Feb 1, 2022

Uh oh!

michaeljmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

frederic-kneier commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Feb 2, 2022

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Feb 2, 2022

Uh oh!

lhotari commented Feb 2, 2022

Uh oh!

lhotari commented Feb 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lhotari commented Feb 1, 2022 •

edited

Loading

frederic-kneier commented Feb 1, 2022 •

edited

Loading