Skip to content

KAFKA-8602: Fix bug in stand-by task creation#7008

Merged
bbejeck merged 12 commits intoapache:trunkfrom
cadonna:AK8602-bug_standby_task_creation
Jul 15, 2019
Merged

KAFKA-8602: Fix bug in stand-by task creation#7008
bbejeck merged 12 commits intoapache:trunkfrom
cadonna:AK8602-bug_standby_task_creation

Conversation

@cadonna
Copy link
Copy Markdown
Member

@cadonna cadonna commented Jun 27, 2019

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jun 27, 2019

Call for review: @mjsax @bbejeck @vvcephei @abbccdda @guozhangwang @ableegoldman
The fix needs to be cherry-picked back to 1.0.

Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Couple of minor comments.

streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Integer().getClass());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Integer().getClass());
streamsConfiguration.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
streamsConfiguration.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no need to overwrite; 1 is the default anyway

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Integer().getClass());
streamsConfiguration.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
streamsConfiguration.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 1);
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

}

@Test
public void shouldNotCreateAnyStandByTasksForStateStoreWithLoggingDisabled() throws InterruptedException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: simplify InterruptedException -> Exception

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


@Override
public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
state.putIfAbsent(key, value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We we actually need to use the store? Seems we can remove this code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@BeforeClass
public static void createTopics() throws InterruptedException {
CLUSTER.createTopic(INPUT_TOPIC, 2, 1);
CLUSTER.createTopic(OUTPUT_TOPIC, 2, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an output topic? We never use it to consume a result?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
state.putIfAbsent(key, value);
final KeyValue<Integer, Integer> result = new KeyValue<>(key, value);
return result;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just return null ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

final KafkaStreams client1 = new KafkaStreams(topology, streamsConfiguration());
final KafkaStreams client2 = new KafkaStreams(topology, streamsConfiguration());

final boolean[] client1IsOk = {false}; // has to be a final array, otherwise flag cannot be modified in lambda
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove comment (we would make it final in any case anyway).

I am wondering if it should be volatile thought, as the state listener callback is used by a different thread?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the main point of the comment is that we need to use an array of booleans instead of a plain boolean variable, because otherwise I cannot declare it final and modify it within the lambda.

Since I used a volatile variable now (good point, btw), the array is not needed anymore. Hence, I removed the comment.

client1IsOk[0] = true;
}
});
final boolean[] client2IsOk = {false}; // has to be a final array, otherwise flag cannot be modified in lambda
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

TestUtils.waitForCondition(
() -> client1IsOk[0] && client2IsOk[0],
30 * 1000,
"At least one client is not in state RUNNING or has a stand-by task");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split both conditions? Maybe only wait for state RUNNING and check localThreadsMetadata after the wait condition?

Copy link
Copy Markdown
Member Author

@cadonna cadonna Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it and the test became flaky. The reason is that without fix the client is in RUNNING when the IllegalStateException is thrown, then it changes to ERROR, PENDING_SHUTDOWN, and finally NOT_RUNNING. It could happen that the wait condition gets satisfied when the client is in RUNNING. Then when the tests verifies localThreadMetadata two scenario can occur:

  1. client is still in RUNNING: localThreadMetadata contains a stand-by task -> assertion not satisfied
  2. client is not in RUNNING: exception is thrown because localThreadMetadata can only be accessed when the client is in RUNNING.

Both scenarios would let the test fail which would be OK, but not really clean. However, the test was sometimes green, which I currently do not understand. My guess would be race condition.

My approach checks 'localThreadMetadata' when the state changes to RUNNING, which should be safe in the error case. In the good case the test could run into the timeout if start-up is slow, though.

@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jun 28, 2019

Retest this, please

cadonna added 4 commits July 1, 2019 10:41
- Verifies that all stream threads of a client do not have assigned
  stand-by tasks. Before only the first stream thread returned by the iterator
  was verified. Now the test is safe to be run with multiple stream threads
  per client.
- Renamed test class to conform with format of `StandbyTask` class
@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 2, 2019

failures unrelated

Retest this, please

@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 3, 2019

Retest this, please

client1.setStateListener((newState, oldState) -> {
if (newState == State.RUNNING &&
client1.localThreadsMetadata().stream().allMatch(thread -> thread.standbyTasks().isEmpty())) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this extra line on purpose so that it is immediately clear that client1IsOk = true; is not part of the if condition. Since indentation of line 113 and 115 is both four characters, both lines would appear at first sight as if they belonged to same code block even though they do not.

TestUtils.waitForCondition(
() -> client1IsOk && client2IsOk,
30 * 1000,
"At least one client did not reach state RUNNING without any stand-by tasks");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could actually show the client state by:
"Some clients didn't reach state RUNNING without any stand-by tasks. Eventual status: [Client1: {}, Client2: {}]", client1IsOk, client2IsOk

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


@Test
public void shouldCreateStandbyTask() {
final MockProcessor mockProcessor = new MockProcessor();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reuse initialization code for L1068 - L1070

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

internalTopologyBuilder.addStateStore(new MockKeyValueStoreBuilder("myStore", true), "processor1");
final StreamThread.StandbyTaskCreator standbyTaskCreator = createStandbyTaskCreator(internalTopologyBuilder);

final StandbyTask standbyTask = standbyTaskCreator.createTask(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standby task could also be reused IIUC

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 4, 2019

Retest this, please

1 similar comment
@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 5, 2019

Retest this, please

@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 5, 2019

failures unrelated

Retest this, please

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cadonna , thanks for the fix! Just one quick question...

final ProcessorTopology topology = builder.build(taskId.topicGroupId);

if (!topology.stateStores().isEmpty()) {
if (!topology.stateStores().isEmpty() && !topology.storeToChangelogTopic().isEmpty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still work for optimized source tables, which read from the input topic instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thinking about this, StandBys for source-KTables might have been broker for a long time already... (maybe since 0.10.0.0???)

Maybe @cadonna can verify? If that is the case, we should split out a separate ticket and PR to fix StandBys for source-KTables independently.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point! Added an integration test to verify materialized and optimized source tables.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot follow. You test seem to use the PAPI, and the PAPI does not provide the KTable optimization. You would need to use StreamBuilder#table() to test the changelog optimization.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See StandbyTaskCreationIntegrationTest line 128.

@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 9, 2019

Retest this, please

@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 10, 2019

Failures unrelated
Retest this, please

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @cadonna !

final Properties streamsConfiguration1 = streamsConfiguration();
streamsConfiguration1.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
final Properties streamsConfiguration2 = streamsConfiguration();
streamsConfiguration2.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not that important, but are these two Properties objects exactly the same?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are not because of streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG, TestUtils.tempDirectory(applicationId).getPath());. Each instance of the Properties has a different state directory, which is good because the test does not work otherwise. I tried it out.

Copy link
Copy Markdown
Member

@bbejeck bbejeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @cadonna LGTM

@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Jul 10, 2019

Retest this, please

@cadonna cadonna closed this Jul 11, 2019
@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 11, 2019

Reopen to trigger build

@cadonna cadonna reopened this Jul 11, 2019
@cadonna
Copy link
Copy Markdown
Member Author

cadonna commented Jul 11, 2019

Retest this, please

@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Jul 11, 2019

retest this please

@cadonna cadonna closed this Jul 12, 2019
@cadonna cadonna reopened this Jul 12, 2019
@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Jul 15, 2019

3 green builds, failing 4th build is from disabling the previous GitHub PR builder in favor of a new one. There were some errors with the new PR builder, so the previous GitHub PR builder has been re-enabled, so merging this PR.

@bbejeck bbejeck merged commit 528e5c0 into apache:trunk Jul 15, 2019
@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Jul 15, 2019

Merged #7008 into trunk

bbejeck added a commit that referenced this pull request Jul 16, 2019
This PR is from the original work by @cadonna in #7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix

Reviewers:  Matthias J. Sax <mjsax@apache.org>
bbejeck added a commit that referenced this pull request Jul 18, 2019
This PR is from the original work by @cadonna in #7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix

Reviewers:  Matthias J. Sax <mjsax@apache.org>
bbejeck added a commit that referenced this pull request Jul 18, 2019
This PR is from the original work by @cadonna in #7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix

Reviewers:  Matthias J. Sax <mjsax@apache.org>
cadonna pushed a commit to cadonna/kafka that referenced this pull request Aug 1, 2019
Backports bugfix in standby task creation from PR apache#7008.
A separate PR is needed because some tests in the original PR
use topology optimizations and mocks that were introduced afterwards.
cadonna added a commit to cadonna/kafka that referenced this pull request Aug 1, 2019
Backports bugfix in standby task creation from PR apache#7008.
A separate PR is needed because some tests in the original PR
use topology optimizations and mocks that were introduced afterwards.
cadonna added a commit to cadonna/kafka that referenced this pull request Aug 1, 2019
Backports bugfix in standby task creation from PR apache#7008.
A separate PR is needed because some tests in the original PR
use topology optimizations and mocks that were introduced afterwards.
cadonna added a commit to cadonna/kafka that referenced this pull request Aug 1, 2019
Backports bugfix in standby task creation from PR apache#7008.
A separate PR is needed because some tests in the original PR
use topology optimizations and mocks that were introduced afterwards.
@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Aug 1, 2019

cherry-picked to 2.3 via #7092
and 7092 was cherry-picked to 2.2 and 2.1

guozhangwang pushed a commit that referenced this pull request Aug 7, 2019
Backports bugfix in standby task creation from PR #7008.
A separate PR is needed because some tests in the original PR
use topology optimizations and mocks that were introduced afterwards.

Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
xiowu0 pushed a commit to linkedin/kafka that referenced this pull request Aug 22, 2019
…ache#7092)

TICKET = KAFKA-8602
LI_DESCRIPTION =
EXIT_CRITERIA = HASH [011ec51]
ORIGINAL_DESCRIPTION =

This PR is from the original work by @cadonna in apache#7008. Due to incompatible changes in trunk that should not get cherry-picked back, a separate PR is required for this bug fix

Reviewers:  Matthias J. Sax <mjsax@apache.org>
(cherry picked from commit 011ec51)
@cadonna cadonna deleted the AK8602-bug_standby_task_creation branch October 21, 2019 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants