Skip to content

[Issue 11070][Zookeeper] Fix netcat returning early for probe#14088

Merged
codelipenghui merged 1 commit intoapache:masterfrom
frederic-kneier:patch-1
Feb 8, 2022
Merged

[Issue 11070][Zookeeper] Fix netcat returning early for probe#14088
codelipenghui merged 1 commit intoapache:masterfrom
frederic-kneier:patch-1

Conversation

@frederic-kneier
Copy link
Copy Markdown
Contributor

Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.

Fixes #11070

Motivation

Readiness and liveness probes in Kubernetes fail in Kubernetes in some cases because the check does not wait for a response.

Modifications

The check script now waits for 1 seconds for a response.

Verifying this change

Since this problem is caused by a race condition, testing is a bit complicated.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: no
  • The schema: no
  • The default values of configurations: no
  • The wire protocol: no
  • The rest endpoints: no
  • The admin cli options: no
  • Anything that affects deployment: don't know

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

  • [ x ] no-need-doc

    The actual intention of the script does not change.

Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.
@github-actions github-actions Bot added the doc-not-needed Your PR changes do not impact docs label Feb 1, 2022
@merlimat merlimat added this to the 2.10.0 milestone Feb 1, 2022
@merlimat merlimat added component/deploy type/bug The PR fixed a bug or issue reported a bug labels Feb 1, 2022
Copy link
Copy Markdown
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frederic-kneier would you mind providing a little more explanation for this change and the problem its solving? In looking at the ZK documentation (https://zookeeper.apache.org/doc/r3.7.0/zookeeperAdmin.html), I don't see any indication that users should need to use the -q argument. Based on my understanding, echo ruok is not generating an EOF, so the -q argument won't change the script's behavior.

@frederic-kneier
Copy link
Copy Markdown
Contributor Author

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.

for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok
for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 1, 2022

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.

for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok

Thanks for the explanation @frederic-kneier .
Btw. I've been struggling with the Zookeeper probes and this has been causing some instability in https://github.com/apache/pulsar-helm-chart . Some attempts to improve the situation:
apache/pulsar-helm-chart#220
apache/pulsar-helm-chart#214
apache/pulsar-helm-chart#202

I just wonder if the value for -q should be more than 1?

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 1, 2022

Since the cause of the problem is a race condition

@frederic-kneier could you share a reference to the race condition that you are referring? It would be interesting to learn more about that. Thanks! Great work on this issue!

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 1, 2022

I did some googling and there's -q -1 in this answer: https://unix.stackexchange.com/a/274603/ .

lhotari added a commit to 315157973/pulsar-helm-chart that referenced this pull request Feb 1, 2022
@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 1, 2022

I'm running this experiment in the Apache Pulsar Helm Chart: apache/pulsar-helm-chart@98dd029 . Let's see if the tests finally pass.

Copy link
Copy Markdown
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.

for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok

@frederic-kneier - thank you for the explanation and the example! That definitely makes sense. I agree with @lhotari in wondering if the value should be more than 1? We could defer on choosing the value and consider making it configurable, too.

@frederic-kneier
Copy link
Copy Markdown
Contributor Author

frederic-kneier commented Feb 1, 2022

@lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It has to be "echo ruok | nc -q 1 localhost 2181"

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 2, 2022

@lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It has to be "echo ruok | nc -q 1 localhost 2181"

@frederic-kneier ok, good to hear about that. I'm trying to understand the reason why it fixes the problem.

If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply.

In this case, this explanation doesn't seem to hold. The apache/pulsar:2.8.2 image contains netcat-openbsd and for it, -q -1 is the default and would prevent closing the socket after stdin EOF.

Here's some parts of the man page for netcat-openbsd nc

    -N      shutdown(2) the network socket after EOF on the input.  Some servers require this to finish their work.

     -q seconds
             after EOF on stdin, wait the specified number of seconds and then quit. If seconds is negative, wait forever (default).  Specifying a non-negative seconds implies -N.

It says "Some servers require this to finish their work". I wonder why this is the case for Zookeeper. It seems to happen only when Zookeeper is configured using org.apache.zookeeper.server.NettyServerCnxnFactory.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome finding, but let's hold back merging until we have an explanation why -q 1 solves the problem.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 2, 2022

I found some explanation in https://stackoverflow.com/questions/4160347/close-vs-shutdown-socket/23483487 .
netcat will send a FIN and close the connection cleanly when using -q 1. I guess that the default for netcat-openbsd is that it will wait for the other end to close the connection unless -q 1 is specified.
It feels like a bug in Zookeeper, but I'm fine if this mitigates it. I'm just wonder what other consequences there could be.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 2, 2022

I made a similar change to the Apache Pulsar Helm chart: apache/pulsar-helm-chart#223 . Instead of relying on the pulsar-zookeeper-ruok.sh script in the container image, I replaced it with bash -c 'echo ruok | nc -q 1 localhost 2181 | grep imok'.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Feb 2, 2022

the -q 1 didn't solve the issue in Pulsar Helm Chart CI tests when TLS is enabled for Zookeeper.
I made a change to send the "ruok" to the TLS port when TLS is enabled. The command used is bash -c 'echo ruok | openssl s_client -quiet -crlf -connect localhost:2281 -cert /pulsar/certs/zookeeper/tls.crt -key /pulsar/certs/zookeeper/tls.key | grep imok'. A similar approach is used in the Bitnami Zookeeper Helm chart for Zookeeper probes. This doesn't resolve the issue and TLS tests still fail occasionally. This means that the problem is elsewhere. Perhaps the ZK fix apache/zookeeper#1800 resolves it.

lhotari added a commit to lhotari/pulsar-helm-chart that referenced this pull request Feb 7, 2022
@codelipenghui codelipenghui merged commit 9284d42 into apache:master Feb 8, 2022
lhotari added a commit to apache/pulsar-helm-chart that referenced this pull request Feb 17, 2022
…ecify "-q 1" for nc (#223)

- NOTICE: we are no more using "bin/pulsar-zookeeper-ruok.sh" from the apachepulsar/pulsar docker image. The probe script is part of the chart.

* Pass "-q 1" to netcat (nc) to fix issue with Zookeeper ruok probe

- see apache/pulsar#14088

* Send ruok to TLS port when TLS is enabled

* Bump chart version
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zookeeper in 3 node Kubernetes cluster does not pass heath check in 2.8.0

5 participants