KAFKA-10794 Replica leader election is too slow in the case of too many partitions by Montyleo · Pull Request #9675 · apache/kafka

Montyleo · 2020-12-02T08:47:53Z

There is more than 6000 topics and 300 brokers in my kafka cluster, and we frequently run kafka-preferred-replica-election.sh to rebalance our cluster. But the reblance process spendes too more time and cpu resource like the picture blow.

We find that the function:'controllerContext.allPartitions' is invoked too many times.
Thr jira link is https://issues.apache.org/jira/browse/KAFKA-10794

…tions

chia7712

@Montyleo nice finding. LGTM

Montyleo · 2020-12-02T11:57:49Z

@Montyleo nice finding. LGTM

Thanks

Montyleo · 2020-12-02T12:02:08Z

@huxihx Please help me review the code, thanks.

chia7712 · 2020-12-02T12:06:27Z

@Montyleo Is the failed test related to this PR?

Montyleo · 2020-12-02T12:27:45Z

@Montyleo Is the failed test related to this PR?
Hi, chia7712
Thanks for your reply. The failed test is about SaslAuthenticator, not related to this PR, even not related the component:kafkacontroller. It seems that the jdk 8 version is too low. I'll find the reason.

chia7712 · 2020-12-02T13:54:48Z

I'll find the reason.

Is there a existent ticket? If not, could you file a jira to log it? Also, you can assign the ticket to yourself ( I have given the permission to you) if you have free cycle to trace it.

I will merge this PR tomorrow if no objection.

Montyleo · 2020-12-02T14:58:42Z

I'll find the reason.

Is there a existent ticket? If not, could you file a jira to log it? Also, you can assign the ticket to yourself ( I have given the permission to you) if you have free cycle to trace it.

I will merge this PR tomorrow if no objection.

Ok，I have no objection. I have created a jira to log it, [https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-10797?filter=allissues]. I'll trace it in my local environment.

chia7712 · 2020-12-03T02:38:46Z

@Montyleo Thanks for your contribution!

lqjack · 2020-12-03T11:11:00Z

@chia7712 does the patch can resolve the issue ? I find the only differences is that controllerContext.allPartitions can be invoked once or the number of partition times . please correct me if I am wrong. thanks.

chia7712 · 2020-12-03T17:02:40Z

@lqjack good question!

I find the only differences is that controllerContext.allPartitions can be invoked once or the number of partition times .

controllerContext.allPartitions does not return a constant value. It create a new collection and the overhead could be high if there are a lot of partitions. This PR makes controllerContext.allPartitions be called only once to reduce the cost of getting "all partitions".

does the patch can resolve the issue ?

@Montyleo It seems to me the optimization of this PR is good enough. However, it would be better to show the improvement on your env by this patch.

…t-for-generated-requests * apache-github/trunk: MINOR: Fix flaky test shouldQueryOnlyActivePartitionStoresByDefault (apache#9681) KAFKA-10799 AlterIsr utilizes ReplicaManager ISR metrics (apache#9677) MINOR: Fix KTable-KTable foreign-key join example (apache#9683) KAFKA-10473: Add docs on partition size-on-disk, and other log-related metrics (apache#9276) KAFKA-10739; Replace EpochEndOffset with automated protocol (apache#9630) KAFKA-10460: ReplicaListValidator format checking is incomplete (apache#9326) KAFKA-10554; Perform follower truncation based on diverging epochs in Fetch response (apache#9382) MINOR: Align the UID inside/outside container (apache#9652) KAFKA-10794 Replica leader election is too slow in the case of too many partitions (apache#9675) KAFKA-10090 Misleading warnings: The configuration was supplied but i… (apache#8826) clients/src/main/java/org/apache/kafka/common/requests/OffsetsForLeaderEpochResponse.java clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java core/src/test/scala/unit/kafka/server/epoch/util/ReplicaFetcherMockBlockingSend.scala

Montyleo · 2020-12-18T08:25:26Z

@chia7712 does the patch can resolve the issue ? I find the only differences is that controllerContext.allPartitions can be invoked once or the number of partition times . please correct me if I am wrong. thanks.

Hi，lqjack

Thanks for your question.
There is a saying that: quantitative change leads to qualitative change.
when the function controllerContext.allPartitions was called too many time, the rebalance will become too slow.
I'll show you the effect after the PR published，1.3ms VS 35541ms

… slow in the case of too many partitions (apache#9675) Co-authored-by: limengmonty <limengmonty@didichuxing.com> Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

fix Replica leader election is too slow in the case of too many parti…

33cc082

…tions

chia7712 approved these changes Dec 2, 2020

View reviewed changes

chia7712 changed the title ~~fix Replica leader election is too slow in the case of too many parti…~~ KAFKA-10794 Replica leader election is too slow in the case of too many partitions Dec 2, 2020

Montyleo requested a review from chia7712 December 2, 2020 11:53

chia7712 approved these changes Dec 2, 2020

View reviewed changes

chia7712 merged commit 10b0757 into apache:trunk Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-10794 Replica leader election is too slow in the case of too many partitions#9675

KAFKA-10794 Replica leader election is too slow in the case of too many partitions#9675
chia7712 merged 1 commit intoapache:trunkfrom
Montyleo:replica-reblace-improve

Montyleo commented Dec 2, 2020 •

edited by chia7712

Loading

Uh oh!

chia7712 left a comment

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

chia7712 commented Dec 2, 2020

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

chia7712 commented Dec 2, 2020

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

chia7712 commented Dec 3, 2020

Uh oh!

lqjack commented Dec 3, 2020

Uh oh!

chia7712 commented Dec 3, 2020

Uh oh!

Montyleo commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Montyleo commented Dec 2, 2020 • edited by chia7712 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

chia7712 commented Dec 2, 2020

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

chia7712 commented Dec 2, 2020

Uh oh!

Montyleo commented Dec 2, 2020

Uh oh!

chia7712 commented Dec 3, 2020

Uh oh!

lqjack commented Dec 3, 2020

Uh oh!

chia7712 commented Dec 3, 2020

Uh oh!

Montyleo commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Montyleo commented Dec 2, 2020 •

edited by chia7712

Loading