Skip to content

KAFKA-7576: Fix shutdown of replica fetcher threads#5875

Merged
hachikuji merged 6 commits intoapache:trunkfrom
rajinisivaram:KAFKA-7576-replica-fetcher-update
Nov 16, 2018
Merged

KAFKA-7576: Fix shutdown of replica fetcher threads#5875
hachikuji merged 6 commits intoapache:trunkfrom
rajinisivaram:KAFKA-7576-replica-fetcher-update

Conversation

@rajinisivaram
Copy link
Copy Markdown
Contributor

ReplicaFetcherThread.shutdown attempts to close the fetcher's Selector while the thread is running. This in unsafe and can result in Selector.close() failing with an exception. The exception is caught and logged at debug level, but this can lead to socket leak if the shutdown is due to dynamic config update rather than broker shutdown. This PR changes the shutdown logic to close Selector after the replica fetcher thread is shutdown, with a wakeup() and flag to terminate blocking sends first.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 3, 2018

Is the bug being fixed here a regression?

@rajinisivaram rajinisivaram force-pushed the KAFKA-7576-replica-fetcher-update branch from 6bea963 to dabb207 Compare November 3, 2018 11:05
@rajinisivaram
Copy link
Copy Markdown
Contributor Author

@ijuma It is not exactly a regression since the bug was originally introduced under KAFKA-6051 which was in 1.1, the same release as dynamic config updates. That can result in an exception being thrown from replica fetcher shutdown, during broker shutdown or when replica fetchers are reduced using dynamic config update. In 2.0.1 and 2.1.0, the exception propagation was fixed under KAFKA-7464, which catches and ignores the exception. That may be ok during shutdown, but would result in resource leakage with dynamic config update. So this PR is essentially redoing the changes under KAFKA-6051 and KAFKA-7464.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to have a way to initiate an orderly shutdown of KafkaClient? It seems a bit more general.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma Thanks for the review. I have updated with a new method in KafkaClient to initiate shutdown. Not sure if it matches what you had in mind. Let me know what you think.

@rajinisivaram rajinisivaram requested a review from ijuma November 9, 2018 09:53
@ijuma ijuma requested a review from hachikuji November 12, 2018 19:54
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 12, 2018

@hachikuji can you please review this one?

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajinisivaram. Looks good. Just had a few small comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a strong reason to check aborted sends before this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, moved it to the start of the method.

Comment thread clients/src/main/java/org/apache/kafka/clients/NetworkClientUtils.java Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the "shutdown" at the end seems unintended?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of icky. I wonder if we should consider adding a field to ShutdownableThread to save the failure exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is not nice. The first log entry comparison was added under KAFKA-7464. And I extended that to add another of the same type. I think we want the test to ensure that exceptions are not propagated to the caller and that we are closing the blockingSend instance. And that is tested by this subset of that test:

thread.initiateShutdown()
thread.awaitShutdown()
verify(mockBlockingSend)

We dont really need to verify the log entry at all for this. I suppose the log entry comparison checks that the test is correct by verifying that it did inject that exception, but not sure we really need to do that. It feels unnecessary to store away the exception in ShutdownableThread just to check this. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji Thank you, updated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there may not be a non-intrusive way to accomplish this more reliably. Perhaps we could check the size of the message queue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being lazy here, updated the test to wait for the server to receive one byte.

@rajinisivaram rajinisivaram force-pushed the KAFKA-7576-replica-fetcher-update branch from 393120b to b4c0a9b Compare November 15, 2018 11:01
@rajinisivaram
Copy link
Copy Markdown
Contributor Author

@hachikuji Thanks for the review. I addressed some of the comments and left a couple of questions for the remaining.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Rajini!

@hachikuji hachikuji merged commit 1a4d44f into apache:trunk Nov 16, 2018
hachikuji pushed a commit that referenced this pull request Nov 16, 2018
ReplicaFetcherThread.shutdown attempts to close the fetcher's Selector while the thread is running. This in unsafe and can result in `Selector.close()` failing with an exception. The exception is caught and logged at debug level, but this can lead to socket leak if the shutdown is due to dynamic config update rather than broker shutdown. This PR changes the shutdown logic to close Selector after the replica fetcher thread is shutdown, with a wakeup() and flag to terminate blocking sends first.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
rajinisivaram added a commit that referenced this pull request Nov 16, 2018
ReplicaFetcherThread.shutdown attempts to close the fetcher's Selector while the thread is running. This in unsafe and can result in `Selector.close()` failing with an exception. The exception is caught and logged at debug level, but this can lead to socket leak if the shutdown is due to dynamic config update rather than broker shutdown. This PR changes the shutdown logic to close Selector after the replica fetcher thread is shutdown, with a wakeup() and flag to terminate blocking sends first.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
rajinisivaram added a commit that referenced this pull request Nov 16, 2018
ReplicaFetcherThread.shutdown attempts to close the fetcher's Selector while the thread is running. This in unsafe and can result in `Selector.close()` failing with an exception. The exception is caught and logged at debug level, but this can lead to socket leak if the shutdown is due to dynamic config update rather than broker shutdown. This PR changes the shutdown logic to close Selector after the replica fetcher thread is shutdown, with a wakeup() and flag to terminate blocking sends first.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
ReplicaFetcherThread.shutdown attempts to close the fetcher's Selector while the thread is running. This in unsafe and can result in `Selector.close()` failing with an exception. The exception is caught and logged at debug level, but this can lead to socket leak if the shutdown is due to dynamic config update rather than broker shutdown. This PR changes the shutdown logic to close Selector after the replica fetcher thread is shutdown, with a wakeup() and flag to terminate blocking sends first.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants