KAFKA-7965 (part-1): Fix one case which makes ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup flaky#8437
Conversation
…RollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup flaky
|
Does the first consumer, which is kicked out, rejoin the group? The number of consumers in the group is N -1 or N - 2 after rebalance is completed? |
@chia7712 No, it does not rejoin the group because the "pollers" used in the test suite fail fast. The number of consumers in the group after the rebalance is complete is N-2. |
|
@dajac Thanks, interesting investigation. I think my only concern with the patch is that it affects all of the tests in Considering the nature of the edge case itself, would you consider it a bug? It definitely seems less than ideal that leader changes could cause some members to be kicked unnecessarily. I am wondering if we can change the eviction logic to make the process more reliable. As I understand it, the way it works currently is the following:
Combining these two, it seems that following a coordinator reload, we will always end up evicting the first members that rejoin the group. That seems surprising. Could we change that so that the last members to rejoin the group are kicked instead? If we did that, then leader changes wouldn't be a problem (I think). |
|
@hachikuji I understand your concern. It is probably not a good idea to disable it for all the tests in Yeah, you're right. The behavior is less than ideal thus we could consider it as a bug. Evicting the last members that rejoin the group sounds like a good way to fix the root cause of the issue. I will give it a shot. Thanks for your feedback! |
I have been investigating
ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroupduring the last week. I have identified two cases that makes it fail from times to times, especially under high resource constraints. This PR explains and propose a fix for the first case.In a nutshell, two consumers are kicked out of the group because of the preferred leader election:
The
ConsumerAssignmentPollerstop themselves when an exception is raised and they report the exception. Therefore the test fails because two consumers have been kicked out from the group where it expects only one to be kicked out.To mitigate this, I propose to disable the
AutoLeaderRebalanceEnablePropfor all the tests inConsumerBounceTest. It makes things unpredictable and therefore increase the ricks of flakiness.I haven't been able to get this failure again with this fix. I have run the single test for 24+ hours in a while loop within a docker contain with limited resources to verify.
Bellow, you can find the relevant traces captured when the test failed.
Committer Checklist (excluded from commit message)