Skip to content

KAFKA-15804: Close SocketServer channels when calling shutdown before enableRequestProcessing#14729

Merged
gharris1727 merged 5 commits intoapache:trunkfrom
gharris1727:kafka-15804-socketserver-shutdown-leak
May 10, 2024
Merged

KAFKA-15804: Close SocketServer channels when calling shutdown before enableRequestProcessing#14729
gharris1727 merged 5 commits intoapache:trunkfrom
gharris1727:kafka-15804-socketserver-shutdown-leak

Conversation

@gharris1727
Copy link
Copy Markdown
Contributor

@gharris1727 gharris1727 commented Nov 10, 2023

The KafkaServer startup() method instantiates a SocketServer, and then waits to call SocketServer enableRequestProcessing. In the intervening time, an exception may be thrown which prevents the enableRequestProcessing from ever being called, and the KafkaServer skips to calling SocketServer shutdown() instead.

In this situation, the current SocketServer Acceptor and Processor implementations do not close their sockets. This causes the sockets to be leaked, potentially interfering with other tests. This change makes calling close() without first calling enableRequestProcessing perform the cleanup that the acceptor and processor threads would have performed.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

… enableRequestProcessors

Signed-off-by: Greg Harris <greg.harris@aiven.io>
Copy link
Copy Markdown
Contributor

@ex172000 ex172000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add some tests?

def close(): Unit = {
beginShutdown()
thread.join()
if (!started) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to put the started as a parameter to the method to make it look cleaner?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ex172000 thanks for the review! Do you mean as a parameter to closeAll?

I don't think that's necessary as started is an instance variable. Also, I didn't add the condition to finally block because it should always be true at that point in the code, so it's only present on this code path where the thread has not been started yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a race possible if close is invoked concurrently with start?

If so, I think one possible solution (though slightly hacky) could be to set started = true before kicking off the thread that would be responsible for cleanup, instead of after. Of course synchronization is also a possibility, but I'm not sure on the implications for blocking or even deadlock.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Similar concern with Processor::start / Processor::close)

…losed

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727 gharris1727 requested a review from ex172000 November 10, 2023 17:38
@hudeqi hudeqi requested review from ex172000 and hudeqi and removed request for ex172000 November 13, 2023 02:44
@gharris1727 gharris1727 added the core Kafka Broker label Nov 14, 2023
Copy link
Copy Markdown
Contributor

@hudeqi hudeqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @gharris1727 This pr change makes sense for me, left a comment. Thanks.

Comment thread core/src/main/scala/kafka/network/SocketServer.scala Outdated
@gharris1727
Copy link
Copy Markdown
Contributor Author

Hey @cmccabe @hachikuji Could you take a look at this?

@gharris1727
Copy link
Copy Markdown
Contributor Author

@chia7712 @showuon @mimaison Are any of you able to review this resource leak fix? Thanks!

Copy link
Copy Markdown
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Greg. I'm not terribly familiar with this neck of the woods but this does seem like a case of improper resource cleanup on failed startup, and the fix does appear to be (modulo some small concurrency concerns) sound.

Comment thread core/src/main/scala/kafka/network/SocketServer.scala Outdated
def close(): Unit = {
beginShutdown()
thread.join()
if (!started) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a race possible if close is invoked concurrently with start?

If so, I think one possible solution (though slightly hacky) could be to set started = true before kicking off the thread that would be responsible for cleanup, instead of after. Of course synchronization is also a possibility, but I'm not sure on the implications for blocking or even deadlock.

def close(): Unit = {
beginShutdown()
thread.join()
if (!started) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Similar concern with Processor::start / Processor::close)

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727 gharris1727 requested a review from C0urante January 31, 2024 23:43
@gharris1727
Copy link
Copy Markdown
Contributor Author

Hi @C0urante could you take another pass on this?

Copy link
Copy Markdown
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Greg, LGTM!

It'd be nice to get someone more familiar with core to take a look at this, but I don't consider it a requisite for merging.

@gharris1727
Copy link
Copy Markdown
Contributor Author

gharris1727 commented May 10, 2024

I noticed some instability in th SocketServerTest suite locally, but it doesn't appear to be introduced by this change. It appears on trunk (and 3.7, 3.6, 3.5) and coincides with JDK >= 17. I opened a ticket for it here: https://issues.apache.org/jira/browse/KAFKA-16701

I ran this test suite locally with JDK 11 and got consistent passes, and it passes in CI. The other failures in CI look unrelated, and pass locally.

I think i'm comfortable merging this PR at this time.

@gharris1727 gharris1727 merged commit 4e4f7d3 into apache:trunk May 10, 2024
gongxuanzhang pushed a commit to gongxuanzhang/kafka that referenced this pull request Jun 12, 2024
… enableRequestProcessing (apache#14729)

Signed-off-by: Greg Harris <greg.harris@aiven.io>
Reviewers: Chris Egerton <chrise@aiven.io>, hudeqi <1217150961@qq.com>, Qichao Chu <qichao@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants