ZOOKEEPER-4453: NettyServerCnxnFactory: allow to configure the early TLS connection drop feature#1799
Conversation
…TLS connection drop feature - add new flag netty.server.earlyDropSecureConnectionHandshakes to turn on/off ZOOKEEPER-3682 - disable ZOOKEEPER-3682 by default - add docs - add tests for this patch and for ZOOKEEPER-3682
|
I saw a second zookeeper started in Kubernetes using this branch so the fix seems good. |
|
@andrekramer1 thank you for confirming. |
|
@eolivelli not able to test SSL and it was a one node Kubernetes test. It is similar to what I had with the new config false. |
symat
left a comment
There was a problem hiding this comment.
nice change and great to have the new tests!
| throw new IOException("ZK down"); | ||
| LOG.warn("Closing connection to {} because the server is not ready", | ||
| getRemoteSocketAddress()); | ||
| close(DisconnectReason.IO_EXCEPTION); |
There was a problem hiding this comment.
nit: I know we were throwing IOException before which resulted in DisconnectReason.IO_EXCEPTION, but now that we do proper close here, we might use some better DisconnectReason. I see e.g. DisconnectReason.SERVER_SHUTDOWN. Although I'm not sure when this part is triggered... I guess this can be triggered either during initialization or shutdown. Maybe DisconnectReason.IO_EXCEPTION is good enough.
Also: before this change we did log the 'getRemoteSocketAddress()' in line 534, which might be handy to add to this warning above. (if someone is trying to figure out in the ZK log why a session terminated)
There was a problem hiding this comment.
we have the same "ZK down" exception in line 477. Maybe the logic should be changed there as well?
There was a problem hiding this comment.
this is the same behaviour as before.
I would like to not change the behaviour with this patch if it is not strictly needed.
before this change we did log the 'getRemoteSocketAddress()' in line 534, which might be handy to add to this warning above
we are still printing it. I cannot get this comment
| } | ||
| } | ||
| } finally { | ||
| System.clearProperty(NettyServerCnxnFactory.EARLY_DROP_SECURE_CONNECTION_HANDSHAKES); |
There was a problem hiding this comment.
nit: if you clear this in the finally block, then let's move the setProperty call into the try block. (now if the test killed just before the try block, the property won't be cleared. - very-very unlikely, but still...)
Alternatively I would be OK to put the clearProperty call to the afterEach() method and then you don't need the try-finally block.
Also just double-checking: we don't run multiple test classes (or methods in the same class) paralel in the same JVM, right? Let's make sure we avoid some flaky execution.
There was a problem hiding this comment.
we don't run multiple test classes (or methods in the same class)
this is correct, and as ZK uses System properties it wont't be possible in the short term
There was a problem hiding this comment.
then let's move the setProperty call into the try block
@symat fixed
| if (secure && !cnxn.isZKServerRunning()) { | ||
| boolean earlyDropSecureConnectionHandshakes = Boolean.getBoolean(EARLY_DROP_SECURE_CONNECTION_HANDSHAKES); | ||
| if (earlyDropSecureConnectionHandshakes) { | ||
| LOG.warn("Zookeeper server is not running, close the connection before starting the TLS handshake"); |
There was a problem hiding this comment.
nit: I know we had warning level before, but what do you think about INFO level instead? I like to have logs here, just not sure if this is really something the user should worry.
There was a problem hiding this comment.
This is exactly the same thing we printed before in the "catch" block below, but without spamming the logs with a stacktrace and with a meaning less message.
So I did this way in order to not change the behaviour too much but at least removing the stacktrace
| effect due to TLS handshake timeout when there are too many in-flight TLS | ||
| handshakes. Set it to something like 250 is good enough to avoid herd effect. | ||
|
|
||
| * *netty.server.earlyDropSecureConnectionHandshakes* |
There was a problem hiding this comment.
I like this to be false by default. However, AFACT this behaviour was enabled in 3.7.0 by default. Maybe we should mention it in the documentation. (and also in the release doc of 3.7.1 and 3.8.0 if we don't forget)
There was a problem hiding this comment.
that "feature" was enabled in 3.6. this is way Pulsar and Pravega users are not able to upgrade ZK.
I can update the docs and explain the story.
I would like this patch to land to 3.6. 3.7 and 3.8
|
I pushed this patch by mistake to master branch. |
|
I have push forced the master branch. Sorry |
see https://issues.apache.org/jira/browse/ZOOKEEPER-4453 for more context