KAFKA-12495: allow consecutive revoke in incremental cooperative assignor in connector by showuon · Pull Request #10367 · apache/kafka

showuon · 2021-03-20T03:58:00Z

jira: https://issues.apache.org/jira/browse/KAFKA-12495

Allow consecutive revoke in incremental cooperative assignor in connector, to fix the issue that when new members joined right after revocation round, it causes uneven distribution. (please check the jira for better understanding) What I did are:

remove canRevoke variable, since we allow consecutive revoking (as lone as delay == 0) now
compute currentWorkerAssignmentWithoutDuplication, to remove duplicated connectors/tasks from currentWorkerAssignment, so that we can use it in performTaskRevocation method, to compute if we need to revoke more connectors/tasks in this round by computing if the remaining assignments in each worker is higher than totalTasks/totalWorkers
we also passed the configured assignment into performTaskRevocation instead of activeAssignment, so that we can compute the correct expected max assignment number (totalSize / workerSize), instead of the activeTotalSize / workerSize, because the activeTotalSize doesn't include the newAssignments, which will cause the wrong computation, and cause uneven rebalance, or need more round of revoking rebalance.
Improve readability for tests by adding final assignment visualization for each phase of rebalance

With the change of (2) and (3), we can still make sure the revocation is always correct no matter if this is a consecutive revoking, or we have duplicated assignment.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

showuon · 2021-03-20T07:38:32Z

change (1): remove unused canRevoke

showuon · 2021-03-20T07:45:14Z

This is not used anywhere else, so delete it.

showuon · 2021-03-20T07:47:18Z

change (2): compute the current worker assignment excluding deletions and duplicated assignments. If after excluding deletion and duplicated assignments, there are still workers have assignment higher than totalTasksWeHave / totalWorkers, we still need to revoke more tasks.

showuon · 2021-03-20T07:55:55Z

change (3): pass configured (total tasks we have), and currentWorkerAssignmentWithoutDuplication into performTaskRevocation method.

Can you clarify why this change is necessary? I ran the new testTaskAssignmentWhenWorkerJoinAfterRevocation test case with and without it, and although it fails without this change, it looks like that's more due to frail testing logic with the assertAssignment method than an actual bug in the rebalancing logic here. If I remove the assertAssignment calls but manually check on the distribution of C/T across the cluster after the fifth rebalance, everything is balanced.

I've also produced a test case that fails with this change but succeeds without it:

@Test public void testNewWorkerAndNewTasksInSameRound() { doReturn(Collections.EMPTY_MAP).when(assignor).serializeAssignments(assignmentsCapture.capture()); // Start with 40 tasks configState = clusterConfigState(offset, 1, 40); when(coordinator.configSnapshot()).thenReturn(configState); // Start with three workers memberConfigs = memberConfigs(leader, offset, 0, 2); expectGeneration(); assignor.performTaskAssignment(leader, offset, memberConfigs, coordinator, protocolVersion); ++rebalanceNum; returnedAssignments = assignmentsCapture.getValue(); assertDelay(0, returnedAssignments); expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments); assertNoReassignments(memberConfigs, expectedMemberConfigs); applyAssignments(returnedAssignments); memberConfigs = memberConfigs(leader, offset, assignments); // Add 2 tasks configState = clusterConfigState(offset, 1, 42); when(coordinator.configSnapshot()).thenReturn(configState); // Add a worker memberConfigs.put("worker3", new ExtendedWorkerState(leaderUrl, offset, null)); expectGeneration(); assignor.performTaskAssignment(leader, offset, memberConfigs, coordinator, protocolVersion); ++rebalanceNum; returnedAssignments = assignmentsCapture.getValue(); assertDelay(0, returnedAssignments); expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments); assertNoReassignments(memberConfigs, expectedMemberConfigs); applyAssignments(returnedAssignments); memberConfigs = memberConfigs(leader, offset, assignments); // Rebalance once more as a follow-up to task revocation expectGeneration(); assignor.performTaskAssignment(leader, offset, memberConfigs, coordinator, protocolVersion); ++rebalanceNum; returnedAssignments = assignmentsCapture.getValue(); assertDelay(0, returnedAssignments); expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments); assertNoReassignments(memberConfigs, expectedMemberConfigs); applyAssignments(returnedAssignments); memberConfigs = memberConfigs(leader, offset, assignments); assertBalancedAssignments(memberConfigs); verify(coordinator, times(rebalanceNum)).configSnapshot(); verify(coordinator, times(rebalanceNum)).leaderState(any()); verify(coordinator, times(2 * rebalanceNum)).generationId(); verify(coordinator, times(rebalanceNum)).memberId(); verify(coordinator, times(rebalanceNum)).lastCompletedGenerationId(); } private void assertBalancedAssignments(Map<String, ExtendedWorkerState> existingAssignments) { List<Integer> connectorCounts = existingAssignments.values().stream() .map(e -> e.assignment().connectors().size()) .sorted() .collect(Collectors.toList()); List<Integer> taskCounts = existingAssignments.values().stream() .map(e -> e.assignment().tasks().size()) .sorted() .collect(Collectors.toList()); int minConnectors = connectorCounts.get(0); int maxConnectors = connectorCounts.get(connectorCounts.size() - 1); int minTasks = taskCounts.get(0); int maxTasks = taskCounts.get(taskCounts.size() - 1); assertTrue( "Assignments are imbalanced. The spread of connectors across each worker is: " + connectorCounts, maxConnectors - minConnectors <= 1 ); assertTrue( "Assignments are imbalanced. The spread of tasks across each worker is: " + taskCounts, maxTasks - minTasks <= 1 ); }

showuon · 2021-03-20T07:59:38Z

As the comment said, this else block means scheduledRebalance == 0, so we don't need to log the scheduledRebalance and now value

showuon · 2021-03-20T08:25:11Z

This revocation is unnecessary because we revoked the connecor1 and all of the tasks in connector1 in previous round, so when entering this round, the assignment is:
W1: connectors: [C0], tasks: [T0-0, T0-1, T0-2, T0-3]
W2: connectors: [], tasks: []

We can just assign the connector1 and 4 of his tasks into W2, and complete the rebalance.

However, before my change, in performTasksRevocation method, we use activeAssignment as the total tasks, so, we'll get the avg number of tasks each worker can have = 4 (total active tasks) / 2 (total worker) = 2, then revoke 2 tasks, and then assign to 2 workers in next round.

After my change, the avg number of tasks each worker can have will be: 8 (total tasks) / 2 (total workers) = 4, so no tasks will be revoked.

showuon · 2021-03-20T08:29:06Z

Same as above comments. Before my change, we revoke 8 tasks in 1st round, and then 2 tasks in next round. After my change, we revoke total 10 tasks in 1 round.

…gnor in connector

showuon · 2021-03-22T01:57:27Z

+                                                                  Collection<WorkerLoad> completeWorkerAssignmentWithoutDuplication) {
+        int totalConnectorsNum = allConnectorsAndTasks.connectors().size();
+        int totalTasksNum = allConnectorsAndTasks.tasks().size();
+        Collection<WorkerLoad> existingWorkers = completeWorkerAssignmentWithoutDuplication.stream()


Use allConnectorsAndTasks (configured) to compute the total connectors/tasks number, and use the completeWorkerAssignmentWithoutDuplication to compute the existing workers. So it'll always compute the correct expected connector/task number for each worker

showuon · 2021-03-22T07:53:52Z

change (4): improve test readability by adding the final assignment for each phase of rebalance. It'll let other devs/users better understanding how the tests go and how the algorithm works.

showuon · 2021-03-22T07:54:47Z

fix the wrong comment

showuon · 2021-03-22T08:00:13Z

@kkonstantine @rhauch @ramesh-muthusamy , could you help review this PR to fix the uneven distribution in incremental cooperative assignor in connector? Thanks.

…arding multiple worker joins during consecutive rebalance 2. Extended IncrementalCooperativeAssignor to IncrementalCooperativeAPMAssignor. Added appropriate protocols etc. to be able to use the new assignor 3. Currently it refers to the same code as assignment logic is yet to be changed

showuon · 2021-05-20T02:44:25Z

@kkonstantine , could you please check this PR when available? Thank you.

showuon · 2021-06-15T02:39:58Z

@kkonstantine , could you check this PR? Or I should find another guy to review this PR since it's been 3 months? Thanks.

showuon · 2021-07-13T02:36:59Z

@kkonstantine , I suddenly found this is a V3.0 blocker bug. Could you help take a look. Thanks.

showuon · 2022-02-08T03:20:16Z

@kkonstantine , call for review and comments. Thanks.

C0urante

Thanks @showuon, and sorry that this hasn't gotten more attention sooner.

This is my first time going through this part of the code base in detail and it's taken a bit longer than expected to get up to speed here. I've left some comments on specific parts of the changes here, and I have a few other general thoughts that have come up while getting acquainted with this logic that aren't directly related to your PR but that I think might be worth discussing since we're in the neighborhood. If you'd prefer to keep things as focused as possible feel free to ignore :)

Might it make sense to change the order of events so that we assign new connectors and tasks first, as evenly as possible, and only perform a revocation afterward if still necessary? In practice I don't think this will make a difference very often (would require the number of workers and the set of currently-configured C/T in a cluster to change in the same round of rebalance, I think), but it may provide benefit in clusters with frequent churn. Covered by KAFKA-13764.
If the number of tasks assigned to a worker decreases to the point where the cluster becomes imbalanced, will we ever revoke tasks from other workers in order to assign them to that worker and balance the cluster? It looks like performTaskRevocation only does anything if the number of workers in the cluster has changed; should we consider updating or refining that logic? I can imagine a case where there are W workers in a cluster and C connectors running in that cluster, each with W tasks. If the assignment of tasks across that cluster is in perfect round-robin fashion, then for each connector, its final task will be running on the same worker (worker w); if each of those connectors is then reconfigured to use W-1 tasks, that would lead to worker w now having C fewer tasks running on it, which could lead to an imbalanced cluster. Of course this particular example may be too specific to be practical, but the general concept of being able to respond to changes in connector size over time while preserving balanced allocation seems worth considering. Covered by KAFKA-13764.
It looks like we're explicitly revoking deleted C/T during rebalance, and that this is the only mechanism by which workers in the cluster learn that they should stop running those deleted C/T. This seems a little strange, and potentially unsafe. If a worker misses a rebalance and then rejoins the group without having already revoked its C/T, it appears that there's no check in place right now to revoke C/T from that worker that have been deleted in the meantime, since the set of deleted C/T is derived by taking the diff of the previous assignment made by the leader and the set of C/T in the latest view of the config topic. The information about which C/T should be running across the entire cluster is already consistently available to every worker in the cluster after a rebalance by distributing an offset in the config topic that every worker should have read up to--should we add a check in the DistributedHerder to ensure that every C/T it's running after a rebalance is still present in its view of the config topic? We don't necessarily have to do this instead of explicitly revoking deleted C/T during rebalance, it'd probably be safer to just do both if we decide to add this check at all just to avoid increasing the risk of a regression. I also think there's a potential case where a worker might lose contact with the config topic for long enough that a connector deletion (which is recorded by a tombstone message in the config topic) ends up being missed by the worker if topic compaction takes place and that tombstone (and all preceding records with the same key) is dropped from the config topic before the worker is able to resume reading from it. EDIT: It looks like there actually is logic in the DistributedHerder class to pick up on connector deletions from the config topic and apply them directly, which may be redundant (see KAFKA-13631 for more detail). However, this check may still not be sufficient to catch dropped tombstones from the config topic.
I had a pretty hard time reading and understanding the new test case you introduced, although I think you wrote it as clearly and concisely as possible while following the patterns of existing IncrementalCooperativeAssignorTest test cases. The comments you added certainly helped, but there's also a ton of duplicated code that could be simplified into testing utility functions, and some variable names that are at best misleading and at worst inaccurate. Since you've taken the initiative to improve the readability of these tests in this PR with your comments, what do you think about refactoring some of the testing logic as well? I have a local draft that I'd be happy to share if you'd be interested in adding it directly to this PR, or if you think it's worth pursuing but out of scope, I can file a Jira and separate PR. Covered by KAFKA-13764.
I think there's a bug in this line; shouldn't we be combining the two maps (like we do here) instead of potentially overwriting the contents of one with the other via Map::putAll? Obviously this isn't caused by your change but it's simple enough it felt worth pointing out. Covered by KAFKA-13764.

C0urante · 2022-02-09T20:06:22Z

-     * @param completeWorkerAssignment
-     * @return
+     * @param allConnectorsAndTasks                          all the connectors and tasks we need to distribute
+     * @param completeWorkerAssignmentWithoutDuplication     current workers assignment without duplication


I think the existing variable name is fine; we can definitely update the Javadoc to clarify that this assignment should exclude duplicated and to-be-deleted C/T, but something this long is a little hard to read.

C0urante · 2022-02-18T05:44:54Z

+        // W1: assignedTasks:[], assignedTasks:[],
+        //     revokedConnectors:[], revokedTasks:[T0-3]
+        // W2: assignedTasks:[C1], assignedTasks:[T1-0, T1-1]
+        //     revokedConnectors:[] revokedTasks:[]
+        // W3: assignedTasks:[], assignedTasks:[T1-2, T1-3]
+        //     revokedConnectors:[] revokedTasks:[]


Suggested change

// W1: assignedTasks:[], assignedTasks:[],

// revokedConnectors:[], revokedTasks:[T0-3]

// W2: assignedTasks:[C1], assignedTasks:[T1-0, T1-1]

// revokedConnectors:[] revokedTasks:[]

// W3: assignedTasks:[], assignedTasks:[T1-2, T1-3]

// revokedConnectors:[] revokedTasks:[]

// W1: assignedConnectors:[], assignedTasks:[],

// revokedConnectors:[], revokedTasks:[T0-3]

// W2: assignedConnectors:[C1], assignedTasks:[T1-0, T1-1]

// revokedConnectors:[] revokedTasks:[]

// W3: assignedConnectors:[], assignedTasks:[T1-2, T1-3]

// revokedConnectors:[] revokedTasks:[]

C0urante · 2022-02-18T05:45:14Z

+        // W1: assignedTasks:[], assignedTasks:[],
+        //     revokedConnectors:[], revokedTasks:[T0-2]
+        // W2: assignedTasks:[], assignedTasks:[]
+        //     revokedConnectors:[] revokedTasks:[]
+        // W3: assignedTasks:[], assignedTasks:[]
+        //     revokedConnectors:[] revokedTasks:[]
+        // W4: assignedTasks:[], assignedTasks:[T0-3]
+        //     revokedConnectors:[] revokedTasks:[]


Suggested change

// W1: assignedTasks:[], assignedTasks:[],

// revokedConnectors:[], revokedTasks:[T0-2]

// W2: assignedTasks:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W3: assignedTasks:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W4: assignedTasks:[], assignedTasks:[T0-3]

// revokedConnectors:[] revokedTasks:[]

// W1: assignedConnectors:[], assignedTasks:[],

// revokedConnectors:[], revokedTasks:[T0-2]

// W2: assignedConnectors:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W3: assignedConnectors:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W4: assignedConnectors:[], assignedTasks:[T0-3]

// revokedConnectors:[] revokedTasks:[]

C0urante · 2022-02-18T05:45:31Z

+        // W1: assignedTasks:[], assignedTasks:[],
+        //     revokedConnectors:[], revokedTasks:[]
+        // W2: assignedTasks:[], assignedTasks:[]
+        //     revokedConnectors:[] revokedTasks:[]
+        // W3: assignedTasks:[], assignedTasks:[]
+        //     revokedConnectors:[] revokedTasks:[]
+        // W4: assignedTasks:[], assignedTasks:[T0-2]
+        //     revokedConnectors:[] revokedTasks:[]


Suggested change

// W1: assignedTasks:[], assignedTasks:[],

// revokedConnectors:[], revokedTasks:[]

// W2: assignedTasks:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W3: assignedTasks:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W4: assignedTasks:[], assignedTasks:[T0-2]

// revokedConnectors:[] revokedTasks:[]

// W1: assignedConnectors:[], assignedTasks:[],

// revokedConnectors:[], revokedTasks:[]

// W2: assignedConnectors:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W3: assignedConnectors:[], assignedTasks:[]

// revokedConnectors:[] revokedTasks:[]

// W4: assignedConnectors:[], assignedTasks:[T0-2]

// revokedConnectors:[] revokedTasks:[]

C0urante · 2022-02-18T05:54:45Z

+        assertDelay(0, returnedAssignments);
+        expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments);
+        assertNoReassignments(memberConfigs, expectedMemberConfigs);
+        assertAssignment(1, 4, 0, 1, "worker1", "worker2", "worker3");


This assertion style is useful for straightforward test cases but I wonder if we might want something more granular that allows us to assert how many C/T were assigned/revoked from individual workers (instead of across the entire cluster) for cases like this? Or, if that's difficult because of non-deterministic behavior caused by things like Java collections with undefined iteration order, could we at least have something that asserts how many workers should have a given total count of C/T in the cluster (e.g., "assert that 3 workers have 2 connectors assigned to them and 4 tasks, and that 1 worker has 1 connector assigned to it and 3 tasks") or how many workers were assigned/revoked a given number of C/T during the rebalance (e.g., "assert that 2 workers were assigned 2 tasks and revoked 3 tasks, and that 1 worker was assigned 0 tasks and revoked 4 tasks")?

The comments are useful for illustrating what the expectations are on that front, but they aren't testable and so there's no guarantee that they're actually correct. And in fact, after running this through a debugger, I was seeing the correct number of C/T being allocated/revoked during each round, but the actual C/T names (i.e., T0-0 vs T1-1) were different from what's described in the comments.

Covered by KAFKA-13763.

C0urante · 2022-02-18T22:02:29Z

+        assertDelay(0, returnedAssignments);
+        expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments);
+        assertNoReassignments(memberConfigs, expectedMemberConfigs);
+        assertAssignment(0, 1, 0, 0, "worker1", "worker2", "worker3", "worker4");


By this point, we should be completely balanced, but there's no explicit testing logic to verify that. What do you think about adding a utility method like assertBalancedAssignments and then invoking it here (and possibly other places in this test suite)?

private void assertBalancedAssignments(Map<String, ExtendedWorkerState> existingAssignments) { List<Integer> connectorCounts = existingAssignments.values().stream() .map(e -> e.assignment().connectors().size()) .sorted() .collect(Collectors.toList()); List<Integer> taskCounts = existingAssignments.values().stream() .map(e -> e.assignment().tasks().size()) .sorted() .collect(Collectors.toList()); int minConnectors = connectorCounts.get(0); int maxConnectors = connectorCounts.get(connectorCounts.size() - 1); int minTasks = taskCounts.get(0); int maxTasks = taskCounts.get(taskCounts.size() - 1); assertTrue( "Assignments are imbalanced. The spread of connectors across each worker is: " + connectorCounts, maxConnectors - minConnectors <= 1 ); assertTrue( "Assignments are imbalanced. The spread of tasks across each worker is: " + taskCounts, maxTasks - minTasks <= 1 ); }

It might also make these tests easier to write and modify (if we need to tweak rebalancing logic again in the future) if we used this type of method instead of the existing assertAssignment one, since in many cases all that really matters is that we achieve a balanced allocation after a specific series of rebalances, instead of exactly how many C/T were assigned/revoked in the interim.

Covered by KAFKA-13763.

C0urante · 2022-02-18T23:09:55Z

Can you clarify why this change is necessary? I ran the new testTaskAssignmentWhenWorkerJoinAfterRevocation test case with and without it, and although it fails without this change, it looks like that's more due to frail testing logic with the assertAssignment method than an actual bug in the rebalancing logic here. If I remove the assertAssignment calls but manually check on the distribution of C/T across the cluster after the fifth rebalance, everything is balanced.

I've also produced a test case that fails with this change but succeeds without it:

@Test public void testNewWorkerAndNewTasksInSameRound() { doReturn(Collections.EMPTY_MAP).when(assignor).serializeAssignments(assignmentsCapture.capture()); // Start with 40 tasks configState = clusterConfigState(offset, 1, 40); when(coordinator.configSnapshot()).thenReturn(configState); // Start with three workers memberConfigs = memberConfigs(leader, offset, 0, 2); expectGeneration(); assignor.performTaskAssignment(leader, offset, memberConfigs, coordinator, protocolVersion); ++rebalanceNum; returnedAssignments = assignmentsCapture.getValue(); assertDelay(0, returnedAssignments); expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments); assertNoReassignments(memberConfigs, expectedMemberConfigs); applyAssignments(returnedAssignments); memberConfigs = memberConfigs(leader, offset, assignments); // Add 2 tasks configState = clusterConfigState(offset, 1, 42); when(coordinator.configSnapshot()).thenReturn(configState); // Add a worker memberConfigs.put("worker3", new ExtendedWorkerState(leaderUrl, offset, null)); expectGeneration(); assignor.performTaskAssignment(leader, offset, memberConfigs, coordinator, protocolVersion); ++rebalanceNum; returnedAssignments = assignmentsCapture.getValue(); assertDelay(0, returnedAssignments); expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments); assertNoReassignments(memberConfigs, expectedMemberConfigs); applyAssignments(returnedAssignments); memberConfigs = memberConfigs(leader, offset, assignments); // Rebalance once more as a follow-up to task revocation expectGeneration(); assignor.performTaskAssignment(leader, offset, memberConfigs, coordinator, protocolVersion); ++rebalanceNum; returnedAssignments = assignmentsCapture.getValue(); assertDelay(0, returnedAssignments); expectedMemberConfigs = memberConfigs(leader, offset, returnedAssignments); assertNoReassignments(memberConfigs, expectedMemberConfigs); applyAssignments(returnedAssignments); memberConfigs = memberConfigs(leader, offset, assignments); assertBalancedAssignments(memberConfigs); verify(coordinator, times(rebalanceNum)).configSnapshot(); verify(coordinator, times(rebalanceNum)).leaderState(any()); verify(coordinator, times(2 * rebalanceNum)).generationId(); verify(coordinator, times(rebalanceNum)).memberId(); verify(coordinator, times(rebalanceNum)).lastCompletedGenerationId(); } private void assertBalancedAssignments(Map<String, ExtendedWorkerState> existingAssignments) { List<Integer> connectorCounts = existingAssignments.values().stream() .map(e -> e.assignment().connectors().size()) .sorted() .collect(Collectors.toList()); List<Integer> taskCounts = existingAssignments.values().stream() .map(e -> e.assignment().tasks().size()) .sorted() .collect(Collectors.toList()); int minConnectors = connectorCounts.get(0); int maxConnectors = connectorCounts.get(connectorCounts.size() - 1); int minTasks = taskCounts.get(0); int maxTasks = taskCounts.get(taskCounts.size() - 1); assertTrue( "Assignments are imbalanced. The spread of connectors across each worker is: " + connectorCounts, maxConnectors - minConnectors <= 1 ); assertTrue( "Assignments are imbalanced. The spread of tasks across each worker is: " + taskCounts, maxTasks - minTasks <= 1 ); }

showuon · 2022-03-16T12:48:00Z

@C0urante , thanks for your comments. TBH, I need some time to revisit the code (since it's long time ago...), and answering your comments later. Thank you.

C0urante · 2022-03-23T20:15:19Z

Thanks @showuon. In that case, I can file separate issues for a lot of the comments I've made here, and we can try to keep this PR as focused as possible for the sake of moving forward.

showuon · 2022-03-24T03:38:40Z

@C0urante , sure, please file another PR for other comments. And thanks for the comments. However, I'm still concerned that @kkonstantine doesn't like the current solution, and would like to have another proposal as mentioned here. So, I think we still need to get his approval before we can continue. WDYT?

cc @kkonstantine , we need your suggestions here, please!

showuon · 2022-03-28T06:32:12Z

@kkonstantine , sorry to keep pinging you, but we need your advice before we can continue. Thanks.

C0urante · 2022-03-30T21:57:21Z

@showuon FWIW, I've got a refactoring PR up for the testing logic that we might take a look at in the meantime: #11974

Hopefully this should make writing and reviewing tests for changes like in this PR easier in the future.

C0urante · 2022-04-08T17:19:30Z

@showuon FYI, I've just opened #12019, which should address KAFKA-12495 and some other issues with rebalancing, but without using consecutive revocations.

showuon · 2022-04-09T02:54:28Z

Great! I'll take a look next week. Also cc @kkonstantine . Thanks.

cadonna · 2022-04-11T11:35:24Z

@showuon @C0urante @kkonstantine What is the status of this PR? As far as I understand this PR might resolve KAFKA-8391, KAFKA-12283 and KAFKA-12495. Is this correct? Those tickets block the 3.2.0 release.

showuon · 2022-04-11T12:46:02Z

@showuon @C0urante @kkonstantine What is the status of this PR? As far as I understand this PR might resolve KAFKA-8391, KAFKA-12283 and KAFKA-12495. Is this correct? Those tickets block the 3.2.0 release.

@cadonna , it's correct (Only KAFKA-8391 is not 100% sure). So far, we are waiting for @kkonstantine 's comments about what's the better solution for this issue, since he was concerned about the current solution, and has a better solution for that (commented here).

github-olivier-abdesselam · 2022-04-12T12:11:38Z

@showuon @C0urante @kkonstantine What is the status of this PR? As far as I understand this PR might resolve KAFKA-8391, KAFKA-12283 and KAFKA-12495. Is this correct? Those tickets block the 3.2.0 release.

Hi, It very probably also resolves KAFKA-10413.

kkonstantine · 2022-04-13T17:49:18Z

Thanks for working on this fix @showuon. Apologies for taking me so long to return here.

My main concern is related to the proposed change to apply consecutive rebalances that will perform revocations.

The current incremental cooperative rebalancing algorithm, is using two consecutive rebalances in order to move tasks between workers. One rebalance during which revocations are happening and one during which the revoked tasks are reassigned. Although clearly this is not an atomic process (as this issue also demonstrates) I find that it's a good property to maintain and reason about.

Allowing for consecutive revocations that happen immediately when an imbalance is detected might mean that the workers overreact to external circumstances that have caused an imbalanced between the initial calculation of task assignments of the revocation rebalance and the subsequent rebalance for the assignment of revoked tasks. Such circumstances might have to do with rolling upgrades, scaling a cluster up or down or simply might be caused by temporary instability. We were first able to reproduce this issue in integration tests by the test that is currently disabled.

My main thought was that, instead of risking shuffling tasks too aggressively within a short period of time and open the door to bugs that will make workers oscillate between imbalanced task assignments continuously and in a tight loop, we could use the existing mechanism of scheduling delayed rebalances to program workers to perform a pair of rebalanced (revocation + reassignment) soon after an imbalance is detected. Regarding when an imbalance is detected, the good news is that the leader worker sending the assignment during the second rebalance of a pair of rebalances knows that it will send an imbalanced assignment (there's no code to detect right now that but can be easily added just before the assignment is sent). The idea here would be to send this assignment anyways, but also schedule a follow up rebalance that will have the opportunity to balance tasks soon with our standard pair of rebalances that works dependably as long as no new workers are added or removed between the two rebalances. We can discuss what is a good setting for the delay. One obvious possibility is to reuse the existing property. Adding another config just for that seems unwarranted. To shield ourselves from infinite such rebalances the leader should also keep track of how many such attempts have been made and stop attempting to balance out tasks after a certain number of tries. Of course every other normal rebalance should reset both this counter and possibly the delay.

I'd be interested to hear what do you think of this approach that is quite similar to what you have demonstrated already but potentially less risky in terms of changes in the assignor logic and how aggressively the leader attempts to fix an imbalance.

cadonna · 2022-04-14T12:58:54Z

@kkonstantine Thank you for your thoughts! From a 3.2.0 release perspective your proposal seems to be a change that we should postpone to a later release since feature freeze and code freeze has passed. Or is this a regression? If it is a regression is there a quick intermediate fix that we can include into 3.2.0 to unblock the release? If it is not a regression, I would propose to link the corresponding tickets one to each other and then to move them to the next release. Ideally with an assignee.

showuon · 2022-04-14T13:26:29Z

My comment is put in jira ticket. Thanks.

showuon force-pushed the KAFKA-12495 branch from da0f78e to de448bd Compare March 20, 2021 03:59

showuon commented Mar 20, 2021

View reviewed changes

showuon force-pushed the KAFKA-12495 branch from de448bd to 1544d6a Compare March 22, 2021 01:39

KAFKA-12495: allow consecutive revoke in incremental cooperative assi…

00a4a33

…gnor in connector

showuon force-pushed the KAFKA-12495 branch from 1544d6a to 00a4a33 Compare March 22, 2021 01:55

showuon commented Mar 22, 2021

View reviewed changes

showuon force-pushed the KAFKA-12495 branch from 6eae3eb to 03d987d Compare March 22, 2021 07:52

showuon commented Mar 22, 2021

View reviewed changes

kkonstantine added the connect label Mar 23, 2021

showuon mentioned this pull request Mar 26, 2021

KAFKA-12283: disable flaky testMultipleWorkersRejoining to stabilize build #10408

Merged

3 tasks

showuon force-pushed the KAFKA-12495 branch from 03d987d to 1afabc6 Compare March 26, 2021 06:47

KAFKA-12495: add assignment result in each phase for tests

f9ba6da

showuon force-pushed the KAFKA-12495 branch from 1afabc6 to f9ba6da Compare March 26, 2021 06:49

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-12495

a741553

C0urante reviewed Feb 18, 2022

View reviewed changes

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-12495

1af943b

C0urante mentioned this pull request Mar 30, 2022

KAFKA-13763 (1): Improve unit testing coverage and flexibility for IncrementalCooperativeAssignor #11974

Merged

3 tasks

C0urante mentioned this pull request Apr 8, 2022

KAFKA-13764: Improve balancing algorithm for Connect incremental rebalancing #12019

Closed

3 tasks

vamossagar12 mentioned this pull request Aug 26, 2022

KAFKA-12495: Exponential backoff retry to prevent rebalance storms when worker joins after revoking rebalance #12561

Merged

showuon closed this Aug 29, 2022

Conversation

showuon commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

showuon Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon commented Mar 22, 2021

Uh oh!

showuon commented May 20, 2021

Uh oh!

showuon commented Jun 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

showuon commented Jul 13, 2021

Uh oh!

showuon commented Feb 8, 2022

Uh oh!

C0urante left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon commented Mar 16, 2022

Uh oh!

C0urante commented Mar 23, 2022

Uh oh!

showuon commented Mar 24, 2022

Uh oh!

showuon commented Mar 28, 2022

Uh oh!

C0urante commented Mar 30, 2022

Uh oh!

C0urante commented Apr 8, 2022

Uh oh!

showuon commented Apr 9, 2022

Uh oh!

showuon commented Mar 20, 2021 •

edited

Loading

showuon Mar 20, 2021 •

edited

Loading

showuon Mar 20, 2021 •

edited

Loading

showuon Mar 22, 2021 •

edited

Loading

showuon commented Jun 15, 2021 •

edited

Loading

C0urante left a comment •

edited

Loading