Skip to content

ZOOKEEPER-4453: NettyServerCnxnFactory: allow to configure the early TLS connection drop feature#1799

Merged
eolivelli merged 1 commit intoapache:masterfrom
eolivelli:fix/ZOOKEEPER-4453
Jan 26, 2022
Merged

ZOOKEEPER-4453: NettyServerCnxnFactory: allow to configure the early TLS connection drop feature#1799
eolivelli merged 1 commit intoapache:masterfrom
eolivelli:fix/ZOOKEEPER-4453

Conversation

@eolivelli
Copy link
Copy Markdown
Contributor

see https://issues.apache.org/jira/browse/ZOOKEEPER-4453 for more context

…TLS connection drop feature

- add new flag netty.server.earlyDropSecureConnectionHandshakes to turn on/off ZOOKEEPER-3682
- disable ZOOKEEPER-3682 by default
- add docs
- add tests for this patch and for ZOOKEEPER-3682
@andrekramer1
Copy link
Copy Markdown

I saw a second zookeeper started in Kubernetes using this branch so the fix seems good.

@eolivelli
Copy link
Copy Markdown
Contributor Author

@andrekramer1 thank you for confirming.
are you able to test TLS as well ?

@eolivelli eolivelli requested a review from fpj January 26, 2022 11:16
@andrekramer1
Copy link
Copy Markdown

@eolivelli not able to test SSL and it was a one node Kubernetes test. It is similar to what I had with the new config false.

@eolivelli eolivelli self-assigned this Jan 26, 2022
Copy link
Copy Markdown
Contributor

@symat symat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice change and great to have the new tests!

throw new IOException("ZK down");
LOG.warn("Closing connection to {} because the server is not ready",
getRemoteSocketAddress());
close(DisconnectReason.IO_EXCEPTION);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know we were throwing IOException before which resulted in DisconnectReason.IO_EXCEPTION, but now that we do proper close here, we might use some better DisconnectReason. I see e.g. DisconnectReason.SERVER_SHUTDOWN. Although I'm not sure when this part is triggered... I guess this can be triggered either during initialization or shutdown. Maybe DisconnectReason.IO_EXCEPTION is good enough.

Also: before this change we did log the 'getRemoteSocketAddress()' in line 534, which might be handy to add to this warning above. (if someone is trying to figure out in the ZK log why a session terminated)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the same "ZK down" exception in line 477. Maybe the logic should be changed there as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same behaviour as before.
I would like to not change the behaviour with this patch if it is not strictly needed.

before this change we did log the 'getRemoteSocketAddress()' in line 534, which might be handy to add to this warning above
we are still printing it. I cannot get this comment

}
}
} finally {
System.clearProperty(NettyServerCnxnFactory.EARLY_DROP_SECURE_CONNECTION_HANDSHAKES);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you clear this in the finally block, then let's move the setProperty call into the try block. (now if the test killed just before the try block, the property won't be cleared. - very-very unlikely, but still...)

Alternatively I would be OK to put the clearProperty call to the afterEach() method and then you don't need the try-finally block.

Also just double-checking: we don't run multiple test classes (or methods in the same class) paralel in the same JVM, right? Let's make sure we avoid some flaky execution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't run multiple test classes (or methods in the same class)
this is correct, and as ZK uses System properties it wont't be possible in the short term

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's move the setProperty call into the try block
@symat fixed

if (secure && !cnxn.isZKServerRunning()) {
boolean earlyDropSecureConnectionHandshakes = Boolean.getBoolean(EARLY_DROP_SECURE_CONNECTION_HANDSHAKES);
if (earlyDropSecureConnectionHandshakes) {
LOG.warn("Zookeeper server is not running, close the connection before starting the TLS handshake");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know we had warning level before, but what do you think about INFO level instead? I like to have logs here, just not sure if this is really something the user should worry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the same thing we printed before in the "catch" block below, but without spamming the logs with a stacktrace and with a meaning less message.

So I did this way in order to not change the behaviour too much but at least removing the stacktrace

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to INFO

effect due to TLS handshake timeout when there are too many in-flight TLS
handshakes. Set it to something like 250 is good enough to avoid herd effect.

* *netty.server.earlyDropSecureConnectionHandshakes*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this to be false by default. However, AFACT this behaviour was enabled in 3.7.0 by default. Maybe we should mention it in the documentation. (and also in the release doc of 3.7.1 and 3.8.0 if we don't forget)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that "feature" was enabled in 3.6. this is way Pulsar and Pravega users are not able to upgrade ZK.

I can update the docs and explain the story.
I would like this patch to land to 3.6. 3.7 and 3.8

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@eolivelli eolivelli merged commit 2f49b17 into apache:master Jan 26, 2022
@eolivelli
Copy link
Copy Markdown
Contributor Author

I pushed this patch by mistake to master branch.
Sorry

@eolivelli
Copy link
Copy Markdown
Contributor Author

I have push forced the master branch. Sorry

@eolivelli
Copy link
Copy Markdown
Contributor Author

@symat reopened as #1800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants