KAFKA-7610; Proactively timeout new group members if rebalance is delayed by hachikuji · Pull Request #5962 · apache/kafka

hachikuji · 2018-11-28T18:56:25Z

When a consumer first joins a group, it doesn't have an assigned memberId. If the rebalance is delayed for some reason, the client may disconnect after a request timeout and retry. Since the client had not received its memberId, then we do not have a way to detect the retry and expire the previously generated member id. This can lead to unbounded growth in the size of the group until the rebalance has completed.

This patch fixes the problem by proactively completing all JoinGroup requests for new members after a timeout of 5 minutes. If the client is still around, we expect it to retry.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

stanislavkozlovski · 2018-11-29T10:41:48Z

This is scheduled by an executor, right? I just want to make sure this test can't be flaky

I think tasks run by MockTimer just run in the foreground. Execution should be deterministic since we rely on MockTime under the covers.

stanislavkozlovski

LGTM!
Since this is an internal stability improvement, I'm wondering if it will be worth it to backport it to some older versions as well? I'm not sure how we decide what gets backported in Kafka

abbccdda · 2018-12-05T01:22:32Z

Why we increase max session timeout here?

Note this is just a test case. I needed a more reasonable value in order to verify the behavior in this patch which depends on a static timeout.

ijuma · 2018-12-08T17:37:00Z

cc @mumrah

guozhangwang

Just a minor comment, otherwise LGTM.

guozhangwang · 2018-12-10T07:04:35Z

Should we just call invokeJoinCallback which does other things like setting the callback to null, decrementing numMembersAwaitingJoin as well?

The reason I invoked the callback directly is that the member has already been removed from the group. I think that's probably why I didn't both setting the callback to null as well. I think we can just change the callback first. Then all paths go through invokeJoinCallback.

guozhangwang · 2018-12-10T07:07:11Z

Hmm.. in GroupCoordinator code after we've triggered the callback, we actually did not set it to null. Maybe this does not affect the correctness of the logic but I'm a bit concerned it is vulnerable to bugs in the future, maybe we can just remove the callback after triggers it above (see my other comment)?

…ayed (#5962) When a consumer first joins a group, it doesn't have an assigned memberId. If the rebalance is delayed for some reason, the client may disconnect after a request timeout and retry. Since the client had not received its memberId, then we do not have a way to detect the retry and expire the previously generated member id. This can lead to unbounded growth in the size of the group until the rebalance has completed. This patch fixes the problem by proactively completing all JoinGroup requests for new members after a timeout of 5 minutes. If the client is still around, we expect it to retry. Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Boyang Chen <bchen11@outlook.com>, Guozhang Wang <wangguoz@gmail.com>

guozhangwang · 2018-12-11T00:11:18Z

Also cherry-picked to 2.1

abbccdda · 2018-12-28T23:23:24Z

+    // timeout during a long rebalance), they may simply retry which will lead to a lot of defunct
+    // members in the rebalance. To prevent this going on indefinitely, we timeout JoinGroup requests
+    // for new members. If the new member is still there, we expect it to retry.
+    completeAndScheduleNextExpiration(group, member, NewMemberJoinTimeoutMs)


Why do we define a new NewMemberJoinTimeoutMs instead of using the member's brought-in session timeout?

As stated above, for new members they do not have member ids and cannot start sending heartbeats, so session timeout would not matter here. Thus we need a separate value just for this purpose.

…ayed (apache#5962) When a consumer first joins a group, it doesn't have an assigned memberId. If the rebalance is delayed for some reason, the client may disconnect after a request timeout and retry. Since the client had not received its memberId, then we do not have a way to detect the retry and expire the previously generated member id. This can lead to unbounded growth in the size of the group until the rebalance has completed. This patch fixes the problem by proactively completing all JoinGroup requests for new members after a timeout of 5 minutes. If the client is still around, we expect it to retry. Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Boyang Chen <bchen11@outlook.com>, Guozhang Wang <wangguoz@gmail.com>

stanislavkozlovski reviewed Nov 29, 2018

View reviewed changes

stanislavkozlovski approved these changes Nov 29, 2018

View reviewed changes

abbccdda reviewed Dec 5, 2018

View reviewed changes

guozhangwang reviewed Dec 10, 2018

View reviewed changes

hachikuji added 2 commits December 10, 2018 10:03

KAFKA-7610; Use hard-coded timeout for new members joining the group

74de764

Invoke join callback before removal from group

259f706

hachikuji force-pushed the KAFKA-7610-STATIC-TIMEOUT branch from cb752ae to 259f706 Compare December 10, 2018 18:22

guozhangwang merged commit 20069b3 into apache:trunk Dec 10, 2018

abbccdda reviewed Dec 28, 2018

View reviewed changes

hachikuji mentioned this pull request Nov 27, 2019

KAFKA-9232: Coordinator new member heartbeat completion does not work for JoinGroup v3 #7753

Merged

Conversation

hachikuji commented Nov 28, 2018

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stanislavkozlovski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ijuma commented Dec 8, 2018

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Dec 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants