KAFKA-10017: fix flaky EOS-beta upgrade test by mjsax · Pull Request #9688 · apache/kafka

mjsax · 2020-12-04T09:10:05Z

Call for review @abbccdda @ableegoldman @guozhangwang

This PR is for trunk and 2.7. PR for 2.6 is slightly different: #9690

mjsax · 2020-12-04T09:14:31Z

-    private static final int MAX_POLL_INTERVAL_MS = 100 * 1000;
-    private static final int MAX_WAIT_TIME_MS = 60 * 1000;
+    private static final int MAX_POLL_INTERVAL_MS = (int) Duration.ofSeconds(100L).toMillis();
+    private static final long MAX_WAIT_TIME_MS = Duration.ofMinutes(1L).toMillis();


Side cleanup

mjsax · 2020-12-04T09:15:11Z

-    // Note: this pattern only works when we just have a single instance running with a single thread
-    // If we want to extend the test or reuse this CommitPunctuator we should tighten it up
-    private final AtomicBoolean requestCommit = new AtomicBoolean(false);
-    private static class CommitPunctuator implements Punctuator {


This punctuator was an attempt to stabilize the test, but without success. Removing it as this should be a proper fix now.

mjsax · 2020-12-04T09:15:51Z

            //   p-2: 10 rec + C ---> 5 rec (pending)
            //   p-3: 10 rec + C ---> 5 rec (pending)
            // crash case: (we just assumes that we inject the error for p-0; in reality it might be a different partition)
+            //             (we don't crash right away and write one record less)


Added some more details/explanations and also renames a few variables below

mjsax · 2020-12-04T09:16:49Z

            waitForRunning(stateTransitions2);

-            final Set<Long> committedKeys = mkSet(0L, 1L, 2L, 3L);
+            final Set<Long> newlyCommittedKeys;


This is the first fix: ie how we compute those keys.

mjsax · 2020-12-04T09:17:39Z

+                final List<KeyValue<Long, Long>> finishSecondBatch = prepareData(15L, 20L, 0L, 1L, 2L, 3L);
+                writeInputData(finishSecondBatch);

+                final List<KeyValue<Long, Long>> committedInputDataDuringUpgrade = uncommittedInputDataBeforeFirstUpgrade


This is the second fix: depending on task movement, we have different set of committed records.

Nice catch.

Reminds me though, why the second rebalance may not be deterministic in migrating tasks back? I thought our algorithm should produce deterministic results? cc @ableegoldman

mjsax · 2020-12-04T09:18:36Z

+                );
+
+                expectedUncommittedResult.addAll(
+                    computeExpectedResult(finishSecondBatch, uncommittedState)


For this, we needed to preserve old uncommittedState further above.

mjsax · 2020-12-04T09:19:20Z

            waitForRunning(stateTransitions2);

-            committedKeys.addAll(mkSet(0L, 1L, 2L, 3L));
+            newlyCommittedKeys.clear();


Similar fix as above: we compute those keys differently now.

mjsax · 2020-12-04T09:19:55Z

+            final Set<Long> uncommittedKeys = mkSet(0L, 1L, 2L, 3L);
+            uncommittedKeys.removeAll(keysSecondClientAlphaTwo);
+            uncommittedKeys.removeAll(newlyCommittedKeys);
+            final List<KeyValue<Long, Long>> committedInputDataDuringUpgrade = uncommittedInputDataBeforeSecondUpgrade


Similar to above: we need to be more flexible (ie, depend on actual task movement)

I'm guessing the root source of this all is a bad assumption that the assignment would be stable if a stable CLIENT_ID was used? I remember we discussed that back when you first wrote this test, I'm sorry for any misinformation I supplied based on my own assumption about how the CLIENT_ID would be used :/

Yes, the test assumed a more stable task->thread mapping during the assignment. But it turns out, that task assignment may "flip" (not sure about details)

@ableegoldman is it related to the UUID randomness? If yes please ignore my other question above.

Yes, I think so

mjsax · 2020-12-04T09:20:47Z

+        properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG), (int) Duration.ofSeconds(5L).toMillis());
+        properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG), (int) Duration.ofSeconds(5L).minusMillis(1L).toMillis());
        properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG), MAX_POLL_INTERVAL_MS);
+        properties.put(StreamsConfig.producerPrefix(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG), (int) Duration.ofMinutes(5L).toMillis());


I also increase the TX timeout from the to low default of 10 seconds, to avoid broker side TX-abort during the test.

5 minutes seems kind of long, the whole test should take only a few minutes and it has 11 phases. Would 1 minute be more reasonable? Or do we actually need this timeout to cover more than one or two phases?

Good catch -- I set to to 5 minutes during debugging (ie, setting breakpoints). 1 minutes should be enough.

Or do we actually need this timeout to cover more than one or two phases?

Not sure what you mean by this?

Was just thinking about how long a. transaction might possibly be open. 1 minute SGTM

mjsax · 2020-12-04T09:20:58Z

                MULTI_PARTITION_OUTPUT_TOPIC,
-                numberOfRecords
+                numberOfRecords,
+                MAX_WAIT_TIME_MS


Increase wait time here, too.

mjsax · 2020-12-04T09:21:45Z

    }

    private Set<Long> keysFromInstance(final KafkaStreams streams) throws Exception {
-        final ReadOnlyKeyValueStore<Long, Long> store = getStore(


This is another fix (we did see some error for getting the state stores, too).

mjsax · 2020-12-04T09:21:58Z


        @Override
-        public void commitTransaction() throws ProducerFencedException {
+        public void commitTransaction() {


Side cleanup

mjsax · 2020-12-04T09:23:22Z


    private static boolean continueConsuming(final int messagesConsumed, final int maxMessages) {
-        return maxMessages <= 0 || messagesConsumed < maxMessages;
+        return maxMessages > 0 && messagesConsumed < maxMessages;


We have cases when we pass in 0 and for this case, the old code did loop forever until the timeout hits and the test fails. Seems this logic was wrong from the beginning on an we should stop fetching if maxMessages <= 0 instead of looping forever.

ableegoldman

Nice work! Seems like the underlying problem here was just that the task assignments weren't as predictable as we thought?

Had a few minor questions but overall makes sense, if I remember how this test works

ableegoldman · 2020-12-04T19:11:51Z

+            final Set<Long> uncommittedKeys = mkSet(0L, 1L, 2L, 3L);
+            uncommittedKeys.removeAll(keysSecondClientAlphaTwo);
+            uncommittedKeys.removeAll(newlyCommittedKeys);
+            final List<KeyValue<Long, Long>> committedInputDataDuringUpgrade = uncommittedInputDataBeforeSecondUpgrade


I'm guessing the root source of this all is a bad assumption that the assignment would be stable if a stable CLIENT_ID was used? I remember we discussed that back when you first wrote this test, I'm sorry for any misinformation I supplied based on my own assumption about how the CLIENT_ID would be used :/

ableegoldman · 2020-12-04T19:14:21Z

+        properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG), (int) Duration.ofSeconds(5L).toMillis());
+        properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG), (int) Duration.ofSeconds(5L).minusMillis(1L).toMillis());
        properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG), MAX_POLL_INTERVAL_MS);
+        properties.put(StreamsConfig.producerPrefix(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG), (int) Duration.ofMinutes(5L).toMillis());


5 minutes seems kind of long, the whole test should take only a few minutes and it has 11 phases. Would 1 minute be more reasonable? Or do we actually need this timeout to cover more than one or two phases?

ableegoldman · 2020-12-04T19:25:11Z

-            final long potentiallyFirstFailingKey = keyFilterFirstClient.iterator().next();
-            cleanKeys.remove(potentiallyFirstFailingKey);
+            final Set<Long> keysFirstClientAlpha = keysFromInstance(streams1Alpha);
+            final long firstFailingKeyForCrashCase = keysFirstClientAlpha.iterator().next();


Thanks for cleaning up the variable names 🙂

ableegoldman

LGTM, glad we finally have this sorted out (and that it wasn't a real bug)

guozhangwang

LGTM.

BTW should we re-enable this test in the same PR?

guozhangwang · 2020-12-08T06:14:54Z

-            //   p-1: 10 rec + C + 5 rec + A + 5 rec + C + 5 rec + C ---> 10 rec + A + 10 rec + C
-            //   p-2: 10 rec + C + 5 rec + C + 5 rec + A + 5 rec + C ---> 10 rec + C
-            //   p-3: 10 rec + C + 5 rec + C + 5 rec + A + 5 rec + C ---> 10 rec + C
+            //   p-0: 10 rec + C   +   4 rec + A + 5 rec + C + 5 rec + C ---> 10 rec + A + 10 rec + C


Are these changes intentional?

Yes. I wanted to improve the readability of the comment -- the additional blanks separate the the main phases of the test (each main phase write 10 records per partition that should eventually be committed).

guozhangwang · 2020-12-08T06:15:06Z


            // 7. only for crash case:
-            //     7a. restart the second client in eos-alpha mode and wait until rebalance stabilizes
+            //     7a. restart the failed second client in eos-alpha mode and wait until rebalance stabilizes


nit: second failed client?

I think "failed second client" is correct. It's the 2nd client, which has failed, not the 2nd client to have failed (English is confusing 😣 )

guozhangwang · 2020-12-08T06:16:19Z

+                final List<KeyValue<Long, Long>> finishSecondBatch = prepareData(15L, 20L, 0L, 1L, 2L, 3L);
+                writeInputData(finishSecondBatch);

+                final List<KeyValue<Long, Long>> committedInputDataDuringUpgrade = uncommittedInputDataBeforeFirstUpgrade


Nice catch.

Reminds me though, why the second rebalance may not be deterministic in migrating tasks back? I thought our algorithm should produce deterministic results? cc @ableegoldman

guozhangwang · 2020-12-08T06:17:26Z

+            final Set<Long> uncommittedKeys = mkSet(0L, 1L, 2L, 3L);
+            uncommittedKeys.removeAll(keysSecondClientAlphaTwo);
+            uncommittedKeys.removeAll(newlyCommittedKeys);
+            final List<KeyValue<Long, Long>> committedInputDataDuringUpgrade = uncommittedInputDataBeforeSecondUpgrade


@ableegoldman is it related to the UUID randomness? If yes please ignore my other question above.

mjsax · 2020-12-09T01:22:59Z

BTW should we re-enable this test in the same PR?

The test is enabled... But the test failed on the 2.6 branch PR -- Seems there is still something going on.

Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Guozhang Wang <guozhang@confluent.io>

mjsax · 2020-12-11T01:49:29Z

Merged to trunk and cherry-picked to 2.7.

KAFKA-10017: fix flaky EOS-beta upgrade test

e94ce75

mjsax added streams tests Test fixes (including flaky tests) labels Dec 4, 2020

mjsax commented Dec 4, 2020

View reviewed changes

mjsax mentioned this pull request Dec 4, 2020

KAFKA-10017: fix flaky EOS-beta upgrade test #9690

Closed

ableegoldman reviewed Dec 4, 2020

View reviewed changes

Github comment: reduce TX timeout

e64283c

ableegoldman approved these changes Dec 4, 2020

View reviewed changes

guozhangwang approved these changes Dec 8, 2020

View reviewed changes

chia7712 mentioned this pull request Dec 8, 2020

MINOR: work in progress for Eos test(don't review) #9672

Closed

3 tasks

improve error message for debugging

7e9d366

mjsax merged commit 567a2ec into apache:trunk Dec 11, 2020

mjsax deleted the kafka-10017-eos-upgrade-test branch December 11, 2020 01:34

mjsax added a commit that referenced this pull request Dec 11, 2020

KAFKA-10017: fix flaky EOS-beta upgrade test (#9688)

8bcb05d

Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Guozhang Wang <guozhang@confluent.io>

mjsax mentioned this pull request Dec 11, 2020

MINOR: fix error message #9730

Merged

showuon mentioned this pull request Dec 11, 2020

KAFKA-10017: fix 2 issues in EosBetaUpgradeIntegrationTest #9733

Merged

3 tasks

Conversation

mjsax commented Dec 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman left a comment

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Dec 9, 2020

Uh oh!

mjsax commented Dec 11, 2020

Uh oh!

Reviewers

Assignees

Labels

mjsax commented Dec 4, 2020 •

edited

Loading