KAFKA-7944: Improve Suppress test coverage by vvcephei · Pull Request #6382 · apache/kafka

vvcephei · 2019-03-06T16:47:46Z

add a normal windowed suppress with short windows and a short grace
period
improve the smoke test so that it actually verifies the intended
conditions

See https://issues.apache.org/jira/browse/KAFKA-7944

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

* add a normal windowed suppress with short windows and a short grace period * improve the smoke test so that it actually verifies the indended conditions

vvcephei

@guozhangwang @mjsax @bbejeck @ableegoldman ,

In response to a mailing list message that they still saw a problem with Suppress after the fix, I bumped up the priority of KAFKA-7944. (this was similar to the reporter's topology)

I also took the time to double check we weren't getting a false negative (test passing without the intended verification), and learned that actually we could get false negatives for a number of reasons:

we could process the entire data set before the first bounce.
sometimes, the node that we bounce hadn't actually processed anything (it was all assigned to the other node).

I also wanted to verify the smoke test app with EOS.

Since I took the time to refactor the smoke_test system test to achieve these goals, and there was a large overlap with the bounce test, I also unified the tests by adding the crash parameter to the smoke test.

I requested a review from you all so that we can get as much scrutiny on this system test as possible. Trying not to create a new flaky test here...

vvcephei · 2019-03-06T16:49:22Z

        final ArrayList<SmokeTestClient> clients = new ArrayList<>();

-        CLUSTER.createTopics(SmokeTestDriver.topics());
+        IntegrationTestUtils.cleanStateBeforeTest(CLUSTER, SmokeTestDriver.topics());


More stable and future-proof test setup code. I needed this because I added a second test here, which I've since removed.

Leaving this change, though, so we won't have to waste time debugging again next time we try to add a test.

vvcephei · 2019-03-06T16:49:31Z

-    public static Collection<Object[]> parameters() {
-        return Arrays.asList(new Object[] {false}, new Object[] {true});
-    }
-


vvcephei · 2019-03-06T16:50:30Z

        final KafkaStreams streamsClient = new KafkaStreams(build, getStreamsConfig(props));
        streamsClient.setStateListener((newState, oldState) -> {
-            System.out.printf("%s: %s -> %s%n", name, oldState, newState);
+            System.out.printf("%s %s: %s -> %s%n", name, Instant.now(), oldState, newState);


I added this while debugging the smoke test. Printing the time allows us to correlate events across test nodes and with the test logs.

vvcephei · 2019-03-06T16:53:04Z

+
+        streamify(smallWindowSum, "sws-raw");
+        streamify(smallWindowSum.suppress(untilWindowCloses(BufferConfig.unbounded())), "sws-suppressed");
+


Added a "small window sum" stream to the topology, which does a "normal" windowed computation. The exact data production timing is non-deterministic, so we can't verify the results, but we can verify that there is exactly one result per windowed key in the suppressed stream

vvcephei · 2019-03-06T16:53:42Z


 public class SmokeTestDriver extends SmokeTestUtil {
-    private static final String[] TOPICS = new String[] {
+    private static final String[] TOPICS = {


Apparently this is unnecessary now?

vvcephei · 2019-03-06T17:18:13Z

+            processor2.start()
+            monitor3.wait_until('REBALANCING -> RUNNING',
+                                timeout_sec=120,
+                                err_msg="Never saw 'REBALANCING -> RUNNING' message " + str(processor2.node.account)


Again, wait for the node to start. This isn't strictly necessary, since we're just going to wait for "processed" next, but I thought it made the tests more readable and debuggable. For example, when you get a test failure, you'll know whether the node failed to start or whether it started, but didn't process anything.

Not a comment: I actually think it is indeed necessary to avoid flakiness, remember @bbejeck once talked about it in an older PR?

I think that was a different situation, like maybe we stopped and then re-started a node? In that case, you need to grab a new monitor after you stop and before you re-start, to be sure your grep won't match the "started" message from the first time it was running.

Since we're querying the same monitor here in both blocks, we're grepping over the same range of the file. The monitor pins the start-point for grepping when you create it. For example, since we create this monitor before we start Streams, it uses "byte 1" as the starting points for all greps for both REBALANCING -> RUNNING and processed.

vvcephei · 2019-03-06T17:18:59Z

+                                )
+
+        # make sure we're not already done processing (which would invalidate the test)
+        self.driver.node.account.ssh("! grep 'Result Verification' %s" % self.driver.STDOUT_FILE, allow_fail=False)


Again, before bouncing the node, make sure the verification isn't already done, which would be pointless.

vvcephei · 2019-03-06T17:19:23Z

-        self.processor2.stop()
-        self.processor3.stop()
-        self.processor4.stop()
+        processor3.stop()


The test is already over. We can stop the app gracefully.

vvcephei · 2019-03-06T17:21:01Z

-        node = self.driver.node
-        node.account.ssh("grep SUCCESS %s" % self.driver.STDOUT_FILE, allow_fail=False)
+        if crash and not eos:
+            self.driver.node.account.ssh("grep -E 'SUCCESS|PROCESSED-MORE-THAN-GENERATED' %s" % self.driver.STDOUT_FILE, allow_fail=False)


If we crash without EOS, we might process some duplicates (this is the whole point of EOS), so Streams is operating properly if it either does exactly the right thing or processes too many records.

vvcephei · 2019-03-06T17:21:21Z

+        if crash and not eos:
+            self.driver.node.account.ssh("grep -E 'SUCCESS|PROCESSED-MORE-THAN-GENERATED' %s" % self.driver.STDOUT_FILE, allow_fail=False)
+        else:
+            self.driver.node.account.ssh("grep SUCCESS %s" % self.driver.STDOUT_FILE, allow_fail=False)


In all other cases, we expect to get exactly the right result.

vvcephei · 2019-03-06T18:03:52Z

System tests:

bbejeck · 2019-03-06T18:33:04Z

\cc @ableegoldman for review

bbejeck

Thanks for the updated test @vvcephei this LGTM (assuming system tests pass) just left a couple of minor comments.

bbejeck · 2019-03-06T18:37:36Z

    public static Map<String, Set<Integer>> generate(final String kafka,
                                                     final int numKeys,
                                                     final int maxRecordsPerKey,
+                                                     final Duration timeToSpend,


bbejeck · 2019-03-06T18:47:04Z


+class StreamsSmokeTestEOSJobRunnerService(StreamsSmokeTestBaseService):
+    def __init__(self, test_context, kafka):
+        super(StreamsSmokeTestEOSJobRunnerService, self).__init__(test_context, kafka, "process-eos")


I thought there were already existingEOS tests that use the StreamsSmokeTest?

bbejeck · 2019-03-06T18:50:45Z

+
+        processor2.stop_nodes(not crash)
+
+        processor3.start()


Don't we want to at least processor3 starts successfully here?

We could wait on that, it might make the test more readable. But it's not strictly necessary, since if it doesn't start, then we won't finish processing.

I'll add it for readability.

Done, it occurred to me during the change that it also has the side benefit of making sure that the third processor does process some events.

guozhangwang · 2019-03-07T00:20:53Z

org.apache.kafka.streams.integration.SuppressionDurabilityIntegrationTest.initializationError consistently failing.

guozhangwang · 2019-03-07T00:30:07Z

        final KafkaStreams streamsClient = new KafkaStreams(build, getStreamsConfig(props));
        streamsClient.setStateListener((newState, oldState) -> {
-            System.out.printf("%s: %s -> %s%n", name, oldState, newState);
+            System.out.printf("%s %s: %s -> %s%n", name, Instant.now(), oldState, newState);


guozhangwang · 2019-03-07T00:51:55Z

    }


+    @SuppressWarnings("DynamicRegexReplaceableByCompiledPattern")


Hmm this seems an IDE specific as well? https://github.com/rtyley/intellij-community-set-committer-to-author/blob/master/plugins/InspectionGadgets/src/inspectionDescriptions/DynamicRegexReplaceableByCompiledPattern.html

Ah, oops. I'll remove it.

guozhangwang · 2019-03-07T01:17:13Z

    public static Map<String, Set<Integer>> generate(final String kafka,
                                                     final int numKeys,
                                                     final int maxRecordsPerKey,
+                                                     final Duration timeToSpend,


hey @vvcephei I'm wondering if we can simplify the code further: with timeToSpend now can we just remove autoTerminate?

SmokeTestDriverIntegrationTest.java: we are sending 1000 * 10 = 10K records only, so likely that will stop even before the second instance can be started, similar to the issue you observed in system test. I'd suggest we just set timeToSpend to 10 seconds so we have some enough time to start up to 10 / 1 (sleep time) = 10 instances.

As for StreamsSmokeTest itself, we only disableAutoTerminate in three of StreamsUpgradeTest cases, since we do not need to verify the number of records sent / received but only check upgrade completed. We can also use 30 seconds (not sure if it is sufficient, but we can get on average how much time those three tests will take from our nightlies), and even if the test completes before that it will call driver.stop to force-stop the driver anyways.

guozhangwang · 2019-03-07T01:17:44Z

    public static Map<String, Set<Integer>> generate(final String kafka,
                                                     final int numKeys,
                                                     final int maxRecordsPerKey,
+                                                     final Duration timeToSpend,


Of course we can then remove the whole disableAutoTerminate thing from StreamsSmokeTest.

guozhangwang · 2019-03-07T01:18:42Z

-import time
-
-
-class StreamsBounceTest(KafkaTest):


guozhangwang · 2019-03-07T01:21:37Z

+                // this starts the stream processing app
+                new SmokeTestClient(UUID.randomUUID().toString()).start(streamsProperties);
+                break;
+            case "process-eos":


Can we use StreamsEosTest instead? See my other comment below.

guozhangwang · 2019-03-07T01:24:09Z


+class StreamsSmokeTestEOSJobRunnerService(StreamsSmokeTestBaseService):
+    def __init__(self, test_context, kafka):
+        super(StreamsSmokeTestEOSJobRunnerService, self).__init__(test_context, kafka, "process-eos")


To my other comment above: what's the difference of running StreamsSmokeTestBaseService#process-eos v.s. StreamsEosTestBaseService#process? The former uses StreamsSmokeTest while the latter use StreamsEosTest client. Can we just consolidate them?

guozhangwang · 2019-03-07T01:25:03Z

-        node = self.driver.node
-        node.account.ssh("grep SUCCESS %s" % self.driver.STDOUT_FILE, allow_fail=False)
+        if crash and not eos:
+            self.driver.node.account.ssh("grep -E 'SUCCESS|PROCESSED-MORE-THAN-GENERATED' %s" % self.driver.STDOUT_FILE, allow_fail=False)


guozhangwang · 2019-03-07T01:26:13Z

+            processor2.start()
+            monitor3.wait_until('REBALANCING -> RUNNING',
+                                timeout_sec=120,
+                                err_msg="Never saw 'REBALANCING -> RUNNING' message " + str(processor2.node.account)


Not a comment: I actually think it is indeed necessary to avoid flakiness, remember @bbejeck once talked about it in an older PR?

bbejeck · 2019-03-07T16:31:34Z

Both Java 8 and Java 11 failed but build results already cleaned up.

retest this please

vvcephei · 2019-03-07T19:56:25Z

Jenkins threw some crazy exception during the system tests. Re-running:
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2392

System tests passed: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2019-03-07--001.1552001126--vvcephei--KAFKA-7944-suppress-window-test--48dabcd/report.html

vvcephei · 2019-03-07T22:51:33Z

Reported unrelated failure: https://issues.apache.org/jira/browse/KAFKA-7965

Retest this, please.

vvcephei · 2019-03-08T15:16:59Z

test results are unavailable.

Retest this, please.

guozhangwang · 2019-03-08T19:22:47Z

LGTM to merge as-is (replied to @vvcephei 's question about eos test), leave it to @bbejeck for merging

vvcephei · 2019-03-08T21:28:26Z

Thanks for the response, @guozhangwang . I've created https://issues.apache.org/jira/browse/KAFKA-8080 to follow up.

vvcephei · 2019-03-11T19:31:09Z

test results are unavailable.

Retest this, please.

vvcephei · 2019-03-12T19:33:52Z

Thanks, @guozhangwang !

* warn-apache-kafka/trunk: (41 commits) MINOR: Avoid double null check in KStream#transform() (apache#6429) KAFKA-7944: Improve Suppress test coverage (apache#6382) KAFKA-3522: add missing guards for TimestampedXxxStore (apache#6356) MINOR: Change Trogdor agent's cleanup executor to a cached thread pool (apache#6309) KAFKA-7976; Update config before notifying controller of unclean leader update (apache#6426) KAFKA-7801: TopicCommand should not be able to alter transaction topic partition count KAFKA-8091; Wait for processor shutdown before testing removed listeners (apache#6425) MINOR: Update delete topics zk path in assertion error messages KAFKA-7939: Fix timing issue in KafkaAdminClientTest.testCreateTopicsRetryBackoff KAFKA-7922: Return authorized operations in Metadata request response (KIP-430 Part-2) MINOR: Print usage when parse fails during console producer MINOR: fix Scala compiler warning (apache#6417) KAFKA-7288; Fix check in SelectorTest to wait for no buffered bytes (apache#6415) KAFKA-8065: restore original input record timestamp in forward() (apache#6393) MINOR: cleanup deprectaion annotations (apache#6290) KAFKA-3522: Add TimestampedWindowStore builder/runtime classes (apache#6173) KAFKA-8069; Fix early expiration of offsets due to invalid loading of expire timestamp (apache#6401) KAFKA-8070: Increase consumer startup timeout in system tests (apache#6405) KAFKA-8040: Streams handle initTransactions timeout (apache#6372) KAFKA-7980 - Fix timing issue in SocketServerTest.testConnectionRateLimit (apache#6391) ...

* add a normal windowed suppress with short windows and a short grace period * improve the smoke test so that it actually verifies the intended conditions See https://issues.apache.org/jira/browse/KAFKA-7944 Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>

After #6382, the system test streams_eos_test.py is redundant. As in #20718, the verification logic has already been migrated, so we only need to delete the related system tests Reviewers: Matthias J. Sax <matthias@confluent.io>

After apache#6382, the system test streams_eos_test.py is redundant. As in apache#20718, the verification logic has already been migrated, so we only need to delete the related system tests Reviewers: Matthias J. Sax <matthias@confluent.io>

KAFKA-7944: Improve Suppress test coverage

63c4d7b

* add a normal windowed suppress with short windows and a short grace period * improve the smoke test so that it actually verifies the indended conditions

vvcephei commented Mar 6, 2019

View reviewed changes

remove unused imports

efb4407

bbejeck added the streams label Mar 6, 2019

bbejeck requested review from guozhangwang and mjsax March 6, 2019 18:32

bbejeck approved these changes Mar 6, 2019

View reviewed changes

guozhangwang reviewed Mar 7, 2019

View reviewed changes

vvcephei added 3 commits March 7, 2019 13:04

fix durability test

a724897

split generator use cases and simplify code

5ad02ea

cr comments

48dabcd

vvcephei commented Mar 8, 2019

View reviewed changes

Comment thread streams/src/test/java/org/apache/kafka/streams/tests/SmokeTestDriver.java Outdated

Remove IDEA warning suppression.

7420fd0

guozhangwang merged commit 8e97540 into apache:trunk Mar 12, 2019

vvcephei deleted the KAFKA-7944-suppress-window-test branch March 12, 2019 19:34

RaidenE1 mentioned this pull request Aug 25, 2025

KAFKA-8080: Remove system test #20406

Closed

RaidenE1 mentioned this pull request Dec 1, 2025

KAFKA-8080: Remove streams_eos_test system test #21030

Merged


		streamify(smallWindowSum, "sws-raw");
		streamify(smallWindowSum.suppress(untilWindowCloses(BufferConfig.unbounded())), "sws-suppressed");

		}


		@SuppressWarnings("DynamicRegexReplaceableByCompiledPattern")

Conversation

vvcephei commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbejeck commented Mar 6, 2019

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Mar 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbejeck commented Mar 7, 2019

Uh oh!

vvcephei commented Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vvcephei commented Mar 6, 2019 •

edited

Loading

vvcephei commented Mar 6, 2019 •

edited

Loading

vvcephei commented Mar 7, 2019 •

edited

Loading