[Issue 11070][Zookeeper] Fix netcat returning early for probe#14088
[Issue 11070][Zookeeper] Fix netcat returning early for probe#14088codelipenghui merged 1 commit intoapache:masterfrom
Conversation
Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.
michaeljmarshall
left a comment
There was a problem hiding this comment.
@frederic-kneier would you mind providing a little more explanation for this change and the problem its solving? In looking at the ZK documentation (https://zookeeper.apache.org/doc/r3.7.0/zookeeperAdmin.html), I don't see any indication that users should need to use the -q argument. Based on my understanding, echo ruok is not generating an EOF, so the -q argument won't change the script's behavior.
|
If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change. for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok |
Thanks for the explanation @frederic-kneier . I just wonder if the value for |
@frederic-kneier could you share a reference to the race condition that you are referring? It would be interesting to learn more about that. Thanks! Great work on this issue! |
|
I did some googling and there's |
|
I'm running this experiment in the Apache Pulsar Helm Chart: apache/pulsar-helm-chart@98dd029 . Let's see if the tests finally pass. |
michaeljmarshall
left a comment
There was a problem hiding this comment.
If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.
for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok
@frederic-kneier - thank you for the explanation and the example! That definitely makes sense. I agree with @lhotari in wondering if the value should be more than 1? We could defer on choosing the value and consider making it configurable, too.
|
@lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It has to be "echo ruok | nc -q 1 localhost 2181" |
@frederic-kneier ok, good to hear about that. I'm trying to understand the reason why it fixes the problem.
In this case, this explanation doesn't seem to hold. The apache/pulsar:2.8.2 image contains netcat-openbsd and for it, Here's some parts of the man page for netcat-openbsd nc It says "Some servers require this to finish their work". I wonder why this is the case for Zookeeper. It seems to happen only when Zookeeper is configured using |
lhotari
left a comment
There was a problem hiding this comment.
This is an awesome finding, but let's hold back merging until we have an explanation why -q 1 solves the problem.
|
I found some explanation in https://stackoverflow.com/questions/4160347/close-vs-shutdown-socket/23483487 . |
|
I made a similar change to the Apache Pulsar Helm chart: apache/pulsar-helm-chart#223 . Instead of relying on the |
|
the |
…ecify "-q 1" for nc (#223) - NOTICE: we are no more using "bin/pulsar-zookeeper-ruok.sh" from the apachepulsar/pulsar docker image. The probe script is part of the chart. * Pass "-q 1" to netcat (nc) to fix issue with Zookeeper ruok probe - see apache/pulsar#14088 * Send ruok to TLS port when TLS is enabled * Bump chart version
Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.
Fixes #11070
Motivation
Readiness and liveness probes in Kubernetes fail in Kubernetes in some cases because the check does not wait for a response.
Modifications
The check script now waits for 1 seconds for a response.
Verifying this change
Since this problem is caused by a race condition, testing is a bit complicated.
Does this pull request potentially affect one of the following parts:
Documentation
Check the box below or label this PR directly (if you have committer privilege).
Need to update docs?
[ x ]
no-need-docThe actual intention of the script does not change.