KAFKA-8602: Fix bug in stand-by task creation by cadonna · Pull Request #7008 · apache/kafka

cadonna · 2019-06-27T19:58:48Z

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

cadonna · 2019-06-27T20:43:57Z

Call for review: @mjsax @bbejeck @vvcephei @abbccdda @guozhangwang @ableegoldman
The fix needs to be cherry-picked back to 1.0.

mjsax

Overall LGTM. Couple of minor comments.

mjsax · 2019-06-27T22:23:47Z

+        streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Integer().getClass());
+        streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Integer().getClass());
+        streamsConfiguration.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
+        streamsConfiguration.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 1);


Nit: no need to overwrite; 1 is the default anyway

mjsax · 2019-06-27T22:23:53Z

+        streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Integer().getClass());
+        streamsConfiguration.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
+        streamsConfiguration.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 1);
+        streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");


mjsax · 2019-06-27T22:24:19Z

+    }
+
+    @Test
+    public void shouldNotCreateAnyStandByTasksForStateStoreWithLoggingDisabled() throws InterruptedException {


nit: simplify InterruptedException -> Exception

mjsax · 2019-06-27T22:26:07Z

+
+                @Override
+                public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
+                    state.putIfAbsent(key, value);


We we actually need to use the store? Seems we can remove this code?

mjsax · 2019-06-27T22:26:56Z

+    @BeforeClass
+    public static void createTopics() throws InterruptedException {
+        CLUSTER.createTopic(INPUT_TOPIC, 2, 1);
+        CLUSTER.createTopic(OUTPUT_TOPIC, 2, 1);


Why do we need an output topic? We never use it to consume a result?

mjsax · 2019-06-27T22:27:25Z

+                public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
+                    state.putIfAbsent(key, value);
+                    final KeyValue<Integer, Integer> result = new KeyValue<>(key, value);
+                    return result;


Can just return null ?

mjsax · 2019-06-27T22:29:02Z

+        final KafkaStreams client1 = new KafkaStreams(topology, streamsConfiguration());
+        final KafkaStreams client2 = new KafkaStreams(topology, streamsConfiguration());
+
+        final boolean[] client1IsOk = {false}; // has to be a final array, otherwise flag cannot be modified in lambda


nit: remove comment (we would make it final in any case anyway).

I am wondering if it should be volatile thought, as the state listener callback is used by a different thread?

Here the main point of the comment is that we need to use an array of booleans instead of a plain boolean variable, because otherwise I cannot declare it final and modify it within the lambda.

Since I used a volatile variable now (good point, btw), the array is not needed anymore. Hence, I removed the comment.

mjsax · 2019-06-27T22:29:44Z

+                client1IsOk[0] = true;
+            }
+        });
+        final boolean[] client2IsOk = {false}; // has to be a final array, otherwise flag cannot be modified in lambda


mjsax · 2019-06-27T22:31:01Z

+        TestUtils.waitForCondition(
+            () -> client1IsOk[0] && client2IsOk[0],
+            30 * 1000,
+            "At least one client is not in state RUNNING or has a stand-by task");


Should we split both conditions? Maybe only wait for state RUNNING and check localThreadsMetadata after the wait condition?

I tried it and the test became flaky. The reason is that without fix the client is in RUNNING when the IllegalStateException is thrown, then it changes to ERROR, PENDING_SHUTDOWN, and finally NOT_RUNNING. It could happen that the wait condition gets satisfied when the client is in RUNNING. Then when the tests verifies localThreadMetadata two scenario can occur:

client is still in RUNNING: localThreadMetadata contains a stand-by task -> assertion not satisfied

client is not in RUNNING: exception is thrown because localThreadMetadata can only be accessed when the client is in RUNNING.

Both scenarios would let the test fail which would be OK, but not really clean. However, the test was sometimes green, which I currently do not understand. My guess would be race condition.

My approach checks 'localThreadMetadata' when the state changes to RUNNING, which should be safe in the error case. In the good case the test could run into the timeout if start-up is slow, though.

…_task_creation

cadonna · 2019-06-28T15:12:22Z

Retest this, please

…_task_creation

- Verifies that all stream threads of a client do not have assigned stand-by tasks. Before only the first stream thread returned by the iterator was verified. Now the test is safe to be run with multiple stream threads per client. - Renamed test class to conform with format of `StandbyTask` class

…_task_creation

cadonna · 2019-07-02T09:02:59Z

failures unrelated

Retest this, please

cadonna · 2019-07-03T08:38:45Z

Retest this, please

abbccdda · 2019-07-03T17:07:07Z

+        client1.setStateListener((newState, oldState) -> {
+            if (newState == State.RUNNING &&
+                client1.localThreadsMetadata().stream().allMatch(thread -> thread.standbyTasks().isEmpty())) {
+


nit: extra line

I left this extra line on purpose so that it is immediately clear that client1IsOk = true; is not part of the if condition. Since indentation of line 113 and 115 is both four characters, both lines would appear at first sight as if they belonged to same code block even though they do not.

abbccdda · 2019-07-03T17:09:04Z

+        TestUtils.waitForCondition(
+            () -> client1IsOk && client2IsOk,
+            30 * 1000,
+            "At least one client did not reach state RUNNING without any stand-by tasks");


We could actually show the client state by:
"Some clients didn't reach state RUNNING without any stand-by tasks. Eventual status: [Client1: {}, Client2: {}]", client1IsOk, client2IsOk

abbccdda · 2019-07-03T17:10:07Z


+    @Test
+    public void shouldCreateStandbyTask() {
+        final MockProcessor mockProcessor = new MockProcessor();


We could reuse initialization code for L1068 - L1070

abbccdda · 2019-07-03T17:10:56Z

+        internalTopologyBuilder.addStateStore(new MockKeyValueStoreBuilder("myStore", true), "processor1");
+        final StreamThread.StandbyTaskCreator standbyTaskCreator = createStandbyTaskCreator(internalTopologyBuilder);
+
+        final StandbyTask standbyTask = standbyTaskCreator.createTask(


The standby task could also be reused IIUC

…_task_creation

cadonna · 2019-07-04T12:16:56Z

Retest this, please

cadonna · 2019-07-05T08:56:56Z

Retest this, please

cadonna · 2019-07-05T12:43:14Z

failures unrelated

Retest this, please

vvcephei

Hey @cadonna , thanks for the fix! Just one quick question...

vvcephei · 2019-07-05T19:26:45Z

            final ProcessorTopology topology = builder.build(taskId.topicGroupId);

-            if (!topology.stateStores().isEmpty()) {
+            if (!topology.stateStores().isEmpty() && !topology.storeToChangelogTopic().isEmpty()) {


Does this still work for optimized source tables, which read from the input topic instead?

Good point. Thinking about this, StandBys for source-KTables might have been broker for a long time already... (maybe since 0.10.0.0???)

Maybe @cadonna can verify? If that is the case, we should split out a separate ticket and PR to fix StandBys for source-KTables independently.

Yeah, good point! Added an integration test to verify materialized and optimized source tables.

Cannot follow. You test seem to use the PAPI, and the PAPI does not provide the KTable optimization. You would need to use StreamBuilder#table() to test the changelog optimization.

See StandbyTaskCreationIntegrationTest line 128.

…_task_creation

cadonna · 2019-07-09T19:00:27Z

Retest this, please

cadonna · 2019-07-10T10:23:51Z

Failures unrelated
Retest this, please

vvcephei

Thanks, @cadonna !

vvcephei · 2019-07-10T14:36:40Z

+        final Properties streamsConfiguration1 = streamsConfiguration();
+        streamsConfiguration1.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
+        final Properties streamsConfiguration2 = streamsConfiguration();
+        streamsConfiguration2.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);


Probably not that important, but are these two Properties objects exactly the same?

No, they are not because of streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG, TestUtils.tempDirectory(applicationId).getPath());. Each instance of the Properties has a different state directory, which is good because the test does not work otherwise. I tried it out.

bbejeck

Thanks for the fix @cadonna LGTM

bbejeck · 2019-07-10T16:20:23Z

Retest this, please

cadonna · 2019-07-11T10:34:03Z

Reopen to trigger build

cadonna · 2019-07-11T14:01:13Z

Retest this, please

bbejeck · 2019-07-11T15:56:47Z

retest this please

bbejeck · 2019-07-15T20:38:24Z

3 green builds, failing 4th build is from disabling the previous GitHub PR builder in favor of a new one. There were some errors with the new PR builder, so the previous GitHub PR builder has been re-enabled, so merging this PR.

bbejeck · 2019-07-15T20:40:21Z

Merged #7008 into trunk

@cadonna

This PR is from the original work by @cadonna in #7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix Reviewers: Matthias J. Sax <mjsax@apache.org>

@cadonna

This PR is from the original work by @cadonna in #7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix Reviewers: Matthias J. Sax <mjsax@apache.org>

@cadonna

This PR is from the original work by @cadonna in #7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix Reviewers: Matthias J. Sax <mjsax@apache.org>

Backports bugfix in standby task creation from PR apache#7008. A separate PR is needed because some tests in the original PR use topology optimizations and mocks that were introduced afterwards.

bbejeck · 2019-08-01T18:05:58Z

cherry-picked to 2.3 via #7092
and 7092 was cherry-picked to 2.2 and 2.1

Backports bugfix in standby task creation from PR #7008. A separate PR is needed because some tests in the original PR use topology optimizations and mocks that were introduced afterwards. Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

@cadonna

…ache#7092) TICKET = KAFKA-8602 LI_DESCRIPTION = EXIT_CRITERIA = HASH [011ec51] ORIGINAL_DESCRIPTION = This PR is from the original work by @cadonna in apache#7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix Reviewers: Matthias J. Sax <mjsax@apache.org> (cherry picked from commit 011ec51)

KAFKA-8602: Fix bug in stand-by task creation

fd10fad

bbejeck added the streams label Jun 27, 2019

mjsax reviewed Jun 27, 2019

View reviewed changes

cadonna added 2 commits June 28, 2019 11:12

Merge remote-tracking branch 'upstream/trunk' into AK8602-bug_standby…

c6aa91f

…_task_creation

Incorporate comments

5b1739b

cadonna added 4 commits July 1, 2019 10:41

Merge remote-tracking branch 'upstream/trunk' into AK8602-bug_standby…

256f40c

…_task_creation

Fix checkstyle errors

d27aef7

Merge remote-tracking branch 'upstream/trunk' into AK8602-bug_standby…

6bae8ac

…_task_creation

abbccdda reviewed Jul 3, 2019

View reviewed changes

cadonna added 2 commits July 3, 2019 22:04

Merge remote-tracking branch 'upstream/trunk' into AK8602-bug_standby…

5ccaf63

…_task_creation

Incorporate review comments

55ab731

vvcephei reviewed Jul 5, 2019

View reviewed changes

cadonna added 3 commits July 8, 2019 13:18

Merge remote-tracking branch 'upstream/trunk' into AK8602-bug_standby…

06749bf

…_task_creation

Merge remote-tracking branch 'upstream/trunk' into AK8602-bug_standby…

d52c922

…_task_creation

Add integration test for materialized and optimized source table

ae3290c

vvcephei approved these changes Jul 10, 2019

View reviewed changes

bbejeck approved these changes Jul 10, 2019

View reviewed changes

cadonna closed this Jul 11, 2019

cadonna reopened this Jul 11, 2019

cadonna closed this Jul 12, 2019

cadonna reopened this Jul 12, 2019

bbejeck merged commit 528e5c0 into apache:trunk Jul 15, 2019

bbejeck mentioned this pull request Jul 15, 2019

KAFKA-8602: Separate PR for 2.3 branch #7092

Merged

3 tasks

cadonna mentioned this pull request Aug 1, 2019

KAFKA-8602: Backport bugfix for standby task creation #7145

Closed

3 tasks

cadonna mentioned this pull request Aug 1, 2019

KAFKA-8602: Backport bugfix for standby task creation #7146

Merged

3 tasks

cadonna mentioned this pull request Aug 1, 2019

KAFKA-8602: Backport bugfix for standby task creation #7147

Merged

3 tasks

cadonna mentioned this pull request Aug 1, 2019

KAFKA-8602: Backport bugfix for standby task creation #7148

Merged

3 tasks

cadonna deleted the AK8602-bug_standby_task_creation branch October 21, 2019 11:40

Conversation

cadonna commented Jun 27, 2019

Committer Checklist (excluded from commit message)

Uh oh!

cadonna commented Jun 27, 2019

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna Jun 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna commented Jun 28, 2019

Uh oh!

cadonna commented Jul 2, 2019

Uh oh!

cadonna commented Jul 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna commented Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cadonna commented Jul 5, 2019

Uh oh!

cadonna commented Jul 5, 2019

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

cadonna Jun 28, 2019 •

edited

Loading

cadonna commented Jul 4, 2019 •

edited

Loading