KIP-368: Allow SASL Connections to Periodically Re-Authenticate#5582
KIP-368: Allow SASL Connections to Periodically Re-Authenticate#5582rajinisivaram merged 12 commits intoapache:trunkfrom
Conversation
c306215 to
35abf63
Compare
|
Beginning to add tests, and I need to rebase to get |
35abf63 to
3c38f3f
Compare
|
Rebasing again to resolve conflicts with |
3c38f3f to
ad16bf3
Compare
fddae30 to
4fa70f3
Compare
02d9f0d to
b59ba93
Compare
|
@rajinisivaram I rebased and pushed a squashed commit that implements your approach except it triggers a re-auth upon a write as described by the KIP rather than upon a read (which is what your prototype used). I think triggering upon a write is correct. I'll add metrics and tests soon. Hopefully we will get the 3rd binding up-vote in time for the 2.1.0 deadline today. Thanks again for that prototype -- I could not have gotten there myself in a reasonable amount of time because the low-level network code is pretty tough to pick up. |
|
@rajinisivaram This PR now has everything except:
I added re-authentication to the I will work on the remaining 3 items on Wednesday. Can you review what is here already, or would you prefer to wait? Ron |
|
@rondagostino I will review the PR tonight or tomorrow. There is a failing test iin the PR builds ( |
|
@rajinisivaram It is a test issue. I cannot reproduce the error locally -- it passes on my laptop. I pushed a commit to sleep a bit longer in the hope that it will fix the issue during the build. |
|
@rondagostino Thanks, I have started a system test run on your branch: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1979/. Will review the PR later tonight. |
|
@rajinisivaram Test still fails here but cannot reproduce the issue locally. Pushed another commit to try to fix it... |
rajinisivaram
left a comment
There was a problem hiding this comment.
@rondagostino Thanks for the PR. I had a quick look and left some comments. Will do a more thorough review tomorrow (it was a much larger PR that I expected!)
|
@rajinisivaram Pushed a commit that I think fixes/resolves most of the issues you raised. It's getting late here so I couldn't get to all of them, but I think we are in a better place at this point. |
|
@rajinisivaram The test failures are due to the tests being flaky rather than a coding problem in the feature implementation. I may have been able to solve the delegation token-related test that was failing (it didn't fail in these runs) by increasing the session lifetime in that test (from 50 ms to 500 ms). |
|
@rajinisivaram I just pushed a commit to add an ExpiredSessionsKilledCount metric. The KIP called for expired-connections-killed-total as the metric name but ExpiredSessionsKilledCount is consistent with the existing metric RequestHandlerAvgIdlePercent. Thoughts? |
|
@rondagostino Sorry, I haven't had any time today to look into this PR. Given the complexity of the PR and how close it is to feature freeze, it would be very helpful if we could get another reviewer. I will try and take another look tomorrow, but not sure if we will have enough time to get this merged without help from one more reviewer. If it didn't meet feature freeze deadline, would it work for you if it was committed only to trunk and not the 2.1.0 branch? |
|
@rajinisivaram We would really like this in 2.1.0 if at all possible. Is there anything I can do to help? Maybe find another reviewer, perhaps someone else who has voted for the KIP? |
|
@rondagostino Yes, getting another reviewer would definitely speed up the process. |
|
@rajinisivaram There are a couple of changes that are not necessary as part of this PR -- more like cleanup of things I discovered while working on the alternate implementations that we considered and rejected. I could remove those changes and submit them as minor tickets post-2.1.0 if it helps to minimize the surface area requiring review. |
|
@rajinisivaram I discovered that many of the failing tests were due to the fact that GSSAPI was not re-authenticating because it uses the CLIENT_COMPLETE state on the client and I wasn't handling that possibility correctly. This meant its connections were continually being killed by the server. This is now fixed. I also moved the expiration check from KafkaRequestHandler to SocketServer so that the check is done more upstream and therefore with as little delay as possible; I think that delay sometimes has an impact when we run with really tight expiration times on the order of milliseconds. I'll be curious to see how the tests look now. |
|
@rajinisivaram Both builds passed. I'm going to work on adding the last metrics, and if I get that done tonight then I believe this will be complete. |
|
@rajinisivaram I was able to add the additional metrics. This PR now has all functionality, but I could not figure out how to test server-side expired connection kill or its metric since I have no way of running an older client. I know it works because it was the cause of the test failures in |
|
@rondagostino In terms of testing, I think system test may be the way to go for testing with older clients since we already have other system tests for testing with older clients. If this goes in by Monday, you have until code freeze two weeks later to add the system test to 2.1.0. |
|
@rajinisivaram Ok. Last set of builds was clean again. I'm confident this is working as expected and is able to be merged assuming a review cycle can be completed. |
|
@rajinisivaram Metric documentation is added, this is now done except for any additional PR review required. All conversations from above are resolved/fixed. I've contacted 5 people (4 of whom up-voted the KIP) in the hope that at least one will be able to review. |
|
@rajinisivaram I reverted 9 files and eliminated 500 changed lines from this PR. I will submit minor tickets for these changes separately. These changes perform cleanup of code that was discovered while implementing rejected alternatives. These changes are minor and can be submitted separately at a later date. They have no impact on functionality or performance and are simply reducing technical debt. Keeping these separate reduces the surface area of this PR. |
rajinisivaram
left a comment
There was a problem hiding this comment.
@rondagostino Thanks for the PR. I made one pass over it and left some comments, mostly minor. But there are some significant issues that need addressing too:
- As mentioned in one of the comments, I think brokers need to be able to limit the number of re-authentications accepted from clients to avoid being over-loaded since this is not covered by quotas.
- We need to ensure that clients cannot re-authenticate with a different principal. Otherwise, if there is ever a vulnerability found in the re-authentication logic, the impact would be much worse.
- We need to ensure that clients are only allowed to re-authenticate using the same SASL mechanism as the original authentication.
I think we need more tests to cover these scenarios and also more around interoperability with the all the versions (pre-SASL_HANDSHAKE, SASL_HANDSHAKE v0 and v1, SASL_AUTHENTICATE v0 and v1). May be easier to do with system tests, but we do need them all.
|
Ok, I see #5806 is where this will be fixed. |
7028f11 to
16a2480
Compare
|
@rajinisivaram Any progress on another review? Would like to merge ASAP so it gets as many builds and test runs as possible prior to the next release. |
|
@rondagostino Sorry, haven't had time to review yet, will try and do it today or tomorrow. |
rajinisivaram
left a comment
There was a problem hiding this comment.
@rondagostino Thanks for the updates, it is looking good. Left some minor comments.
| public boolean maybeBeginServerReauthentication(NetworkReceive saslHandshakeNetworkReceive, | ||
| Supplier<Long> nowNanosSupplier) throws AuthenticationException, IOException { | ||
| if (!ready()) | ||
| throw new IllegalStateException("KafkaChannel should always be \"ready\" upon receiving SASL Handshake"); |
There was a problem hiding this comment.
Reword exception message since channel is not ready on receiving first handshake?
There was a problem hiding this comment.
Here's what I changed it to. Not sure if this is what you were getting at. Let me know if there is something else it should say. I also fixed the Javadoc -- it was still referring to "not muted" when we got rid of that check based on a review comment.
if (!ready())
throw new IllegalStateException(
"KafkaChannel should be \"ready\" when processing SASL Handshake for potential re-authentication");
|
@rajinisivaram Latest comments are addressed/pushed. The one issue that might need more attention is the |
rajinisivaram
left a comment
There was a problem hiding this comment.
@rondagostino Thanks for the updates. Responded to a few of the remaining comments (minor changes). I think I will merge once these are addressed and if there are any other changes, we can make them in follow-on PRs. I think the test failures in the last build are unrelated. Can you just confirm that?
|
@rondagostino my recent merge #5684 created some conflicts with this PR. Please rebase the PR. |
The adoption of KIP-255: OAuth Authentication via SASL/OAUTHBEARER in release 2.0.0 creates the possibility of using information in the bearer token to make authorization decisions. Unfortunately, however, Kafka connections are long-lived, so there is no ability to change the bearer token associated with a particular connection. Allowing SASL connections to periodically re-authenticate would resolve this. In addition to this motivation there are two others that are security-related. First, to eliminate access to Kafka for connected clients, the current requirement is to remove all authorizations (i.e. remove all ACLs). This is necessary because of the long-lived nature of the connections. It is operationally simpler to shut off access at the point of authentication, and with the release of KIP-86: Configurable SASL Callback Handlers it is going to become more and more likely that installations will authenticate users against external directories (e.g. via LDAP). The ability to stop Kafka access by simply disabling an account in an LDAP directory (for example) is desirable. The second motivating factor for re-authentication related to security is that the use of short-lived tokens is a common OAuth security recommendation, but issuing a short-lived token to a Kafka client (or a broker when OAUTHBEARER is the inter-broker protocol) currently has no benefit because once a client is connected to a broker the client is never challenged again and the connection may remain intact beyond the token expiration time (and may remain intact indefinitely under perfect circumstances). This KIP proposes adding the ability for clients (and brokers when OAUTHBEARER is the inter-broker protocol) to re-authenticate their connections to brokers and for brokers to close connections that continue to use expired sessions. Signed-off-by: Ron Dagostino <rndgstn@gmail.com>
|
@rajinisivaram Made two suggested changes, both builds were clean. Rebasing onto latest trunk now to resolve conflicts. |
92e7925 to
9c30b80
Compare
|
@rajinisivaram After rebase JDK8 build was clean, JDK 11 build has 2 failures but they seem unrelated/transient issues. |
|
@rondagostino Thanks for the updates. I think we are good to go. I will take a look later today and merge to trunk (if there are any other changes required, we can do them in follow-on PRs). |
rajinisivaram
left a comment
There was a problem hiding this comment.
@rondagostino Thanks for the PR, LGTM! Merging to trunk.
…IP-368) (apache#5582) KIP-368 implementation to enable periodic re-authentication of SASL clients. Also adds a broker configuration option to terminate client connections that do not re-authenticate within the configured interval.
The adoption of KIP-255: OAuth Authentication via
SASL/OAUTHBEARER in release 2.0.0 creates the possibility of
using information in the bearer token to make authorization
decisions. Unfortunately, however, Kafka connections are
long-lived, so there is no ability to change the bearer token
associated with a particular connection. Allowing SASL
connections to periodically re-authenticate would resolve this.
In addition to this motivation there are two others that are
security-related. First, to eliminate access to Kafka for
connected clients, the current requirement is to remove all
authorizations (i.e. remove all ACLs). This is necessary because
of the long-lived nature of the connections. It is operationally
simpler to shut off access at the point of authentication, and
with the release of KIP-86: Configurable SASL Callback Handlers
it is going to become more and more likely that installations
will authenticate users against external directories (e.g. via
LDAP). The ability to stop Kafka access by simply disabling an
account in an LDAP directory (for example) is desirable. The
second motivating factor for re-authentication related to
security is that the use of short-lived tokens is a common OAuth
security recommendation, but issuing a short-lived token to a
Kafka client (or a broker when OAUTHBEARER is the inter-broker
protocol) currently has no benefit because once a client is
connected to a broker the client is never challenged again and
the connection may remain intact beyond the token expiration time
(and may remain intact indefinitely under perfect circumstances).
This KIP proposes adding the ability for clients (and brokers
when OAUTHBEARER is the inter-broker protocol) to re-authenticate
their connections to brokers and for brokers to close connections
that continue to use expired sessions.
Signed-off-by: Ron Dagostino rndgstn@gmail.com