KAFKA-7610; Proactively timeout new group members if rebalance is delayed#5962
Conversation
There was a problem hiding this comment.
This is scheduled by an executor, right? I just want to make sure this test can't be flaky
There was a problem hiding this comment.
I think tasks run by MockTimer just run in the foreground. Execution should be deterministic since we rely on MockTime under the covers.
stanislavkozlovski
left a comment
There was a problem hiding this comment.
LGTM!
Since this is an internal stability improvement, I'm wondering if it will be worth it to backport it to some older versions as well? I'm not sure how we decide what gets backported in Kafka
There was a problem hiding this comment.
Why we increase max session timeout here?
There was a problem hiding this comment.
Note this is just a test case. I needed a more reasonable value in order to verify the behavior in this patch which depends on a static timeout.
|
cc @mumrah |
guozhangwang
left a comment
There was a problem hiding this comment.
Just a minor comment, otherwise LGTM.
There was a problem hiding this comment.
Should we just call invokeJoinCallback which does other things like setting the callback to null, decrementing numMembersAwaitingJoin as well?
There was a problem hiding this comment.
The reason I invoked the callback directly is that the member has already been removed from the group. I think that's probably why I didn't both setting the callback to null as well. I think we can just change the callback first. Then all paths go through invokeJoinCallback.
There was a problem hiding this comment.
Hmm.. in GroupCoordinator code after we've triggered the callback, we actually did not set it to null. Maybe this does not affect the correctness of the logic but I'm a bit concerned it is vulnerable to bugs in the future, maybe we can just remove the callback after triggers it above (see my other comment)?
cb752ae to
259f706
Compare
…ayed (#5962) When a consumer first joins a group, it doesn't have an assigned memberId. If the rebalance is delayed for some reason, the client may disconnect after a request timeout and retry. Since the client had not received its memberId, then we do not have a way to detect the retry and expire the previously generated member id. This can lead to unbounded growth in the size of the group until the rebalance has completed. This patch fixes the problem by proactively completing all JoinGroup requests for new members after a timeout of 5 minutes. If the client is still around, we expect it to retry. Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Boyang Chen <bchen11@outlook.com>, Guozhang Wang <wangguoz@gmail.com>
|
Also cherry-picked to 2.1 |
| // timeout during a long rebalance), they may simply retry which will lead to a lot of defunct | ||
| // members in the rebalance. To prevent this going on indefinitely, we timeout JoinGroup requests | ||
| // for new members. If the new member is still there, we expect it to retry. | ||
| completeAndScheduleNextExpiration(group, member, NewMemberJoinTimeoutMs) |
There was a problem hiding this comment.
Why do we define a new NewMemberJoinTimeoutMs instead of using the member's brought-in session timeout?
There was a problem hiding this comment.
As stated above, for new members they do not have member ids and cannot start sending heartbeats, so session timeout would not matter here. Thus we need a separate value just for this purpose.
…ayed (apache#5962) When a consumer first joins a group, it doesn't have an assigned memberId. If the rebalance is delayed for some reason, the client may disconnect after a request timeout and retry. Since the client had not received its memberId, then we do not have a way to detect the retry and expire the previously generated member id. This can lead to unbounded growth in the size of the group until the rebalance has completed. This patch fixes the problem by proactively completing all JoinGroup requests for new members after a timeout of 5 minutes. If the client is still around, we expect it to retry. Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Boyang Chen <bchen11@outlook.com>, Guozhang Wang <wangguoz@gmail.com>
When a consumer first joins a group, it doesn't have an assigned memberId. If the rebalance is delayed for some reason, the client may disconnect after a request timeout and retry. Since the client had not received its memberId, then we do not have a way to detect the retry and expire the previously generated member id. This can lead to unbounded growth in the size of the group until the rebalance has completed.
This patch fixes the problem by proactively completing all JoinGroup requests for new members after a timeout of 5 minutes. If the client is still around, we expect it to retry.
Committer Checklist (excluded from commit message)