KAFKA-6145: Set HighAvailabilityTaskAssignor as default in streams_upgrade_test.py by cadonna · Pull Request #8613 · apache/kafka

cadonna · 2020-05-04T14:44:18Z

This PR sets HighAvailabilityTaskAssignor as default task assignor in
streams_upgrade_test.py. The verification of the test needed to be
modified to because the HighAvailabilityTaskAssignor surfaced a flakiness
in the test. More precisely, the verifications assume that the last
client that is bounced joins the group before the other two clients
are able to rebalance without the last client. This assumption does not
always hold.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…grade_test.py This PR sets HighAvailabilityTaskAssignor as default task assignor in streams_upgrade_test.py. The verification of the test needed to be modified to because the HighAvailabilityTaskAssignor surfaced a flakiness in the test. More precisely, the verifications assume that the last client that is bounced joins the group before the other two clients are able to rebalance without the last client. This assumption does not always hold.

cadonna · 2020-05-04T14:52:51Z

                                               timeout_sec=60,
                                               err_msg="Could not detect 'successful version probing' at upgrading node " + str(node.account))
-                    else:
-                        log_monitor.wait_until("Sent a version 8 subscription and got version 7 assignment back (successful version probing). Downgrade subscription metadata to commonly supported version 8 and trigger new rebalance.",


This verification is only true if the two other processors haven't rebalanced before the processor that bounced last re-joins the group. If the rebalance occurs, the commonly supported version is already at 8 when the last processor joins.

Actually, the test test_version_probing_upgrade is independent of the used task assignor, but this issue was surfaced by the HighAvailabilityTaskAssignor but not by the StickyTaskAssignor. I cannot say for sure why.

IMO, removing this verification should be OK, since afterwards we check whether the processors have synchronized generations which means that all three processors successfully joined the group in the end. The state that we do not explicitly verify anymore is the transient state where version 7 is currently used, but all processor are able to use version 8.

Thanks @cadonna ; I agree. This test should just be verifying that we first converge on 7, and then that we converge on 8.

cadonna · 2020-05-04T15:08:39Z

Call for review: @vvcephei @ableegoldman

cadonna · 2020-05-04T15:16:38Z

System tests job: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3928/

vvcephei

Thanks for digging into this, @cadonna !

Just a couple of comment/questions.

vvcephei · 2020-05-04T20:02:28Z

                                               timeout_sec=60,
                                               err_msg="Could not detect 'successful version probing' at upgrading node " + str(node.account))
-                    else:
-                        log_monitor.wait_until("Sent a version 8 subscription and got version 7 assignment back (successful version probing). Downgrade subscription metadata to commonly supported version 8 and trigger new rebalance.",


Thanks @cadonna ; I agree. This test should just be verifying that we first converge on 7, and then that we converge on 8.

vvcephei · 2020-05-04T20:02:48Z

        # TODO KIP-441: consider rewriting the test for HighAvailabilityTaskAssignor
        self.processor1 = StreamsUpgradeTestJobRunnerService(self.test_context, self.kafka)
-        self.processor1.set_config("internal.task.assignor.class", "org.apache.kafka.streams.processor.internals.assignment.StickyTaskAssignor")
+        self.processor1.set_config("internal.task.assignor.class", "org.apache.kafka.streams.processor.internals.assignment.HighAvailabilityTaskAssignor")


We can actually just delete these lines now.

vvcephei · 2020-05-04T20:05:59Z

                                               err_msg="Could not detect 'successful version probing' at upgrading node " + str(node.account))
-                    else:
-                        log_monitor.wait_until("Sent a version 8 subscription and got version 7 assignment back (successful version probing). Downgrade subscription metadata to commonly supported version 8 and trigger new rebalance.",
+                        log_monitor.wait_until("Detected that the assignor requested a rebalance. Rejoining the consumer group to trigger a new rebalance.",


I know that this check was here in some fashion before, but I'm drawing a blank on why we need to verify this log line. It seems like just checking the version number logs and nothing else would be the key to a long and happy life.

I think the idea is to verify that the actual version probing rebalance takes place, ie that the partition assignor actually handles the version probing once it's detected. And that it signals to the stream thread which also handles it correctly in turn. But idk -- I've probably broken and fixed the version probing test 2 or 3 times now due to this one line in particular.

So, I'd be happy to see it go. I probably have too much bad history to make an unbiased call here though 😄

We can leave it because it verifies whether the assignment was triggered in the assignor, which is better than nothing. However, it does not give us any guarantee that the rebalance took actually place.

I guess what we really would need is a way to check if a group stabilized and if the assignment is valid. We try to do that by verifying that the generations of the processors are synced. However, I ran into cases where all processors had the same generation, but one processor did not have any tasks assigned. So we would actually need to check if they have the highest generation in sync across the processors AND if all processors have at least one task assigned (AND if all tasks were assigned).

Thanks, all. This doesn't seem like the best way to verify what we're trying to verify, but it also seems about the same as before. I'm happy to leave this here for now.

If/when the test breaks again, I'd prefer for us to put in a more reliable and direct mechanism.

vvcephei

Thanks so much for this fix, @cadonna !

vvcephei · 2020-05-05T18:07:31Z

                                               err_msg="Could not detect 'successful version probing' at upgrading node " + str(node.account))
-                    else:
-                        log_monitor.wait_until("Sent a version 8 subscription and got version 7 assignment back (successful version probing). Downgrade subscription metadata to commonly supported version 8 and trigger new rebalance.",
+                        log_monitor.wait_until("Detected that the assignor requested a rebalance. Rejoining the consumer group to trigger a new rebalance.",


Thanks, all. This doesn't seem like the best way to verify what we're trying to verify, but it also seems about the same as before. I'm happy to leave this here for now.

If/when the test breaks again, I'd prefer for us to put in a more reliable and direct mechanism.

vvcephei · 2020-05-05T18:07:58Z

test this please

cadonna · 2020-05-05T19:26:31Z

Just in case, I re-run the system tests: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3929/

vvcephei · 2020-05-05T19:50:34Z

Thanks @cadonna , Let's see how those tests play out.

vvcephei · 2020-05-07T18:41:03Z

System test passed again: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-05-05--001.1588709838--cadonna--fix_version_probing_system_test--b80ef74/report.html

vvcephei · 2020-05-07T18:41:40Z

The test results are gone now, unfortunately.

Retest this please

vvcephei · 2020-05-08T02:08:56Z

Actually, since the only thing that changed was a python system test file, it couldn't cause any of the integration test failures, so I'll go ahead and merge.

Here were the failures:
org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosBeta[true]
org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosBeta[true]
org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses

org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosBeta[false]

* 'trunk' of github.com:apache/kafka: KAFKA-9290: Update IQ related JavaDocs (apache#8114) KAFKA-9928: Fix flaky GlobalKTableEOSIntegrationTest (apache#8600) KAFKA-6145: Set HighAvailabilityTaskAssignor as default in streams_upgrade_test.py (apache#8613) KAFKA-9667: Connect JSON serde strip trailing zeros (apache#8230) MINOR: Log4j Improvements on Fetcher (apache#8629)

…grade_test.py (apache#8613) Generalize the verification in the upgrade test so that it does not rely on the task assignor's behavior. Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <vvcephei@apache.org>

cadonna commented May 4, 2020

View reviewed changes

vvcephei reviewed May 4, 2020

View reviewed changes

Remove calls to set the task assignor

b80ef74

vvcephei approved these changes May 5, 2020

View reviewed changes

vvcephei merged commit c19a3be into apache:trunk May 8, 2020

cadonna deleted the fix_version_probing_system_test branch May 20, 2020 08:07

Conversation

cadonna commented May 4, 2020

Committer Checklist (excluded from commit message)

Uh oh!

cadonna May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei May 4, 2020

Choose a reason for hiding this comment

Uh oh!

cadonna commented May 4, 2020

Uh oh!

cadonna commented May 4, 2020

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei May 4, 2020

Choose a reason for hiding this comment

Uh oh!

vvcephei May 4, 2020

Choose a reason for hiding this comment

Uh oh!

vvcephei May 4, 2020

Choose a reason for hiding this comment

Uh oh!

ableegoldman May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei May 5, 2020

Choose a reason for hiding this comment

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei May 5, 2020

Choose a reason for hiding this comment

Uh oh!

vvcephei commented May 5, 2020

Uh oh!

cadonna commented May 5, 2020

Uh oh!

vvcephei commented May 5, 2020

Uh oh!

vvcephei commented May 7, 2020

Uh oh!

vvcephei commented May 7, 2020

Uh oh!

vvcephei commented May 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cadonna May 4, 2020 •

edited

Loading

ableegoldman May 5, 2020 •

edited

Loading

cadonna May 5, 2020 •

edited

Loading