Skip to content

KAFKA-10794 Replica leader election is too slow in the case of too many partitions#9675

Merged
chia7712 merged 1 commit intoapache:trunkfrom
Montyleo:replica-reblace-improve
Dec 3, 2020
Merged

KAFKA-10794 Replica leader election is too slow in the case of too many partitions#9675
chia7712 merged 1 commit intoapache:trunkfrom
Montyleo:replica-reblace-improve

Conversation

@Montyleo
Copy link
Copy Markdown
Contributor

@Montyleo Montyleo commented Dec 2, 2020

There is more than 6000 topics and 300 brokers in my kafka cluster, and we frequently run kafka-preferred-replica-election.sh to rebalance our cluster. But the reblance process spendes too more time and cpu resource like the picture blow.

We find that the function:'controllerContext.allPartitions' is invoked too many times.
Thr jira link is https://issues.apache.org/jira/browse/KAFKA-10794

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Montyleo nice finding. LGTM

@chia7712 chia7712 changed the title fix Replica leader election is too slow in the case of too many parti… KAFKA-10794 Replica leader election is too slow in the case of too many partitions Dec 2, 2020
@Montyleo Montyleo requested a review from chia7712 December 2, 2020 11:53
@Montyleo
Copy link
Copy Markdown
Contributor Author

Montyleo commented Dec 2, 2020

@Montyleo nice finding. LGTM

Thanks

@Montyleo
Copy link
Copy Markdown
Contributor Author

Montyleo commented Dec 2, 2020

@huxihx Please help me review the code, thanks.

@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Dec 2, 2020

@Montyleo Is the failed test related to this PR?

@Montyleo
Copy link
Copy Markdown
Contributor Author

Montyleo commented Dec 2, 2020

@Montyleo Is the failed test related to this PR?
Hi, chia7712
Thanks for your reply. The failed test is about SaslAuthenticator, not related to this PR, even not related the component:kafkacontroller. It seems that the jdk 8 version is too low. I'll find the reason.

@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Dec 2, 2020

I'll find the reason.

Is there a existent ticket? If not, could you file a jira to log it? Also, you can assign the ticket to yourself ( I have given the permission to you) if you have free cycle to trace it.

I will merge this PR tomorrow if no objection.

@Montyleo
Copy link
Copy Markdown
Contributor Author

Montyleo commented Dec 2, 2020

I'll find the reason.

Is there a existent ticket? If not, could you file a jira to log it? Also, you can assign the ticket to yourself ( I have given the permission to you) if you have free cycle to trace it.

I will merge this PR tomorrow if no objection.

Ok,I have no objection. I have created a jira to log it, [https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-10797?filter=allissues]. I'll trace it in my local environment.

@chia7712 chia7712 merged commit 10b0757 into apache:trunk Dec 3, 2020
@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Dec 3, 2020

@Montyleo Thanks for your contribution!

@lqjack
Copy link
Copy Markdown
Contributor

lqjack commented Dec 3, 2020

@chia7712 does the patch can resolve the issue ? I find the only differences is that controllerContext.allPartitions can be invoked once or the number of partition times . please correct me if I am wrong. thanks.

@chia7712
Copy link
Copy Markdown
Member

chia7712 commented Dec 3, 2020

@lqjack good question!

I find the only differences is that controllerContext.allPartitions can be invoked once or the number of partition times .

controllerContext.allPartitions does not return a constant value. It create a new collection and the overhead could be high if there are a lot of partitions. This PR makes controllerContext.allPartitions be called only once to reduce the cost of getting "all partitions".

does the patch can resolve the issue ?

@Montyleo It seems to me the optimization of this PR is good enough. However, it would be better to show the improvement on your env by this patch.

ijuma added a commit to ijuma/kafka that referenced this pull request Dec 3, 2020
…t-for-generated-requests

* apache-github/trunk:
MINOR: Fix flaky test shouldQueryOnlyActivePartitionStoresByDefault
(apache#9681)
  KAFKA-10799 AlterIsr utilizes ReplicaManager ISR metrics (apache#9677)
  MINOR: Fix KTable-KTable foreign-key join example (apache#9683)
KAFKA-10473: Add docs on partition size-on-disk, and other log-related
metrics (apache#9276)
  KAFKA-10739; Replace EpochEndOffset with automated protocol (apache#9630)
KAFKA-10460: ReplicaListValidator format checking is incomplete
(apache#9326)
KAFKA-10554; Perform follower truncation based on diverging epochs in
Fetch response (apache#9382)
  MINOR: Align the UID inside/outside container (apache#9652)
KAFKA-10794 Replica leader election is too slow in the case of too
many partitions (apache#9675)
KAFKA-10090 Misleading warnings: The configuration was supplied but i…
(apache#8826)

clients/src/main/java/org/apache/kafka/common/requests/OffsetsForLeaderEpochResponse.java
clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
core/src/test/scala/unit/kafka/server/epoch/util/ReplicaFetcherMockBlockingSend.scala
@Montyleo
Copy link
Copy Markdown
Contributor Author

@chia7712 does the patch can resolve the issue ? I find the only differences is that controllerContext.allPartitions can be invoked once or the number of partition times . please correct me if I am wrong. thanks.

Hi,lqjack

Thanks for your question.
There is a saying that: quantitative change leads to qualitative change.
when the function controllerContext.allPartitions was called too many time, the rebalance will become too slow.
I'll show you the effect after the PR published,1.3ms VS 35541ms

clipboard_image_1608279734070

image

gitlw pushed a commit to linkedin/kafka that referenced this pull request Jan 28, 2021
… slow in the case of too many partitions (apache#9675)

Co-authored-by: limengmonty <limengmonty@didichuxing.com>
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants