KAFKA-7655 Metadata spamming requests from Kafka Streams under some circumstances, potential DOS#5929
Merged
mjsax merged 5 commits intoapache:trunkfrom Dec 13, 2018
Merged
Conversation
mjsax
reviewed
Nov 19, 2018
Member
mjsax
left a comment
There was a problem hiding this comment.
Thanks for the patch.
I think it would be great to get a system test for this, too. Not sure, how easy it is to reproduce the issue with a system test.
mjsax
reviewed
Nov 20, 2018
guozhangwang
approved these changes
Nov 21, 2018
Contributor
guozhangwang
left a comment
There was a problem hiding this comment.
There's a correlated JIRA (https://issues.apache.org/jira/browse/KAFKA-6928) for removing the retry logic completely but delayed due to admin client's own gap on the retries: https://issues.apache.org/jira/browse/KAFKA-6928
I've synced with @hachikuji and I'd like to suggest we merge this PR as-is while I can work on KAFKA-6928 right away while keeping this fix in mind.
Member
|
Retest this please |
Contributor
|
@mjsax @guozhangwang any reason why this hasn't been merged? |
mjsax
pushed a commit
that referenced
this pull request
Dec 13, 2018
…ircumstances, potential DOS (#5929) Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <guozhang@confluent.io>
mjsax
pushed a commit
that referenced
this pull request
Dec 13, 2018
…ircumstances, potential DOS (#5929) Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Member
Contributor
|
Thanks for the patch, @Pasvaz! |
pengxiaolong
pushed a commit
to pengxiaolong/kafka
that referenced
this pull request
Jun 14, 2019
…ircumstances, potential DOS (apache#5929) Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <guozhang@confluent.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-validate and make sure the topic either exists or it's gone by using a delay.
There is a bug in the InternalTopicManager that makes the client believe that a topic exists even though it doesn't, it occurs mostly in those few seconds between when a topic is marked for deletion and when it is actually deleted. In that timespan, the Broker gives inconsistent information, first it hides the topic but then it refuses to create a new one therefore the client believes the topic was existing already and it starts polling for metadata.
The consequence is that the client goes into a loop where it polls for topic metadata and if this is done by many threads it can take down a small cluster or degrade greatly its performances.
The real life scenario is probably a reset gone wrong. Reproducing the issue is fairly simple, these are the steps:
Stop a Kafka streams application
Delete one of its changelog and the local store
Restart the application immediately after the topic delete
You will see the Kafka streams application hanging after the bootstrap saying something like: INFO Metadata - Cluster ID: xxxx
I am attaching a patch that fixes the issue client side but my personal opinion is that this should be tackled on the broker as well, metadata requests seem expensive and it would be easy to craft a DDOS that can potentially take down an entire cluster in seconds just by flooding the brokers with metadata requests.
The patch kicks in only when a topic that wasn't existing in the first call to getNumPartitions triggers a TopicExistsException. When this happens it forces the re-validation of the topic and if it still looks like doesn't exists plan a retry with some delay, to give the broker the necessary time to sort it out.
I think this patch makes sense beside the above mentioned use case where a topic it's not existing, because, even if the topic was actually created, the client should not blindly trust it and should still re-validate it by checking the number of partitions. IE: a topic can be created automatically by the first request and then it would have the default partitions rather than the expected ones.
Committer Checklist (excluded from commit message)