KAFKA-9832: extend Kafka Streams EOS system test by mjsax · Pull Request #8440 · apache/kafka

mjsax · 2020-04-07T20:53:37Z

System test run passed: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3886/

mjsax · 2020-04-07T20:54:43Z

This is an actually bug-fix. StandbyTasks did not set the eos flag to true for eos-beta and thus did not wipe out their stores in case of failure.

mjsax · 2020-04-07T20:55:26Z

This is an actually bug fix: consumedOffsetsAndMetadataPerTask could be empty, if only standby tasks (but no active tasks) are assigned to a thread.

Could you elaborate more on why committing an empty map will fail?

If we only have StandbyTasks assigned, the RecordCollector would not be initialized and thus the KafkaProducer would not initialize transactions and hence the offset commit would fail as we cannot begin a new transaction.

mjsax · 2020-04-07T20:55:39Z

Java8 cleanup

mjsax · 2020-04-07T20:55:44Z

Java8 cleanup

mjsax · 2020-04-07T20:56:12Z

We set processing guarantee "external" now, via the system test properties file

mjsax · 2020-04-07T20:56:22Z

Small side improvement.

Could you elaborate a bit? The comment doesn't seem readable.

Comment says:

// increase commit interval to make sure a client is killed having an open transaction

If we commit with small commit interval, the probability that there is no pending transaction when we kill the instance is high.

We delay to start a new transaction until we do the first send() and would commit quickly afterwards. If we "stall" in between waiting for new data (what is not uncommon in this test) there will be no open tx for some time.

During debugging I did some segment dumps and could not find a single aborted transaction.

Does this make sense?

kk, I think I get the motivation, but still feels a client is killed and having an open transaction could not be connected as a full sentence here. Maybe we should call a killed client?

mjsax · 2020-04-07T20:56:26Z

Java8 cleanup

mjsax · 2020-04-07T20:56:31Z

Java8 cleanup

mjsax · 2020-04-07T20:56:35Z

Java8 cleanup

mjsax · 2020-04-07T20:58:39Z

The previous shutdown hook did not wait until the "main loop" breaks and exits. Hence, the code after the loop was never executed making debugging harder. We introduce the terminated flag to delay the termination of the JVM until the method finished.

mjsax · 2020-04-07T20:59:44Z

This is the "main loop" as mentioned above

mjsax · 2020-04-07T21:00:10Z

This is the code after the "main loop" that was never executed

mjsax · 2020-04-07T21:00:58Z

We use a try-catch to set the flag to make sure the shutdown hook can exit quickly even in case of failure

mjsax · 2020-04-07T21:04:51Z

To make the verification step work, we first need to check that all transactions are finished. With EOS-alpha we never had pending transactions that would eventually be aborted by the tx-coordinator, because while we crash some instanced in between the final shutdown phase is always clean. Hence, for eos-alpha all pending transactions would be aborted by initTransaction() calls.

For eos-beta, thread that are killed leave open transaction that will be eventually expired by the tx-coordinator though, as we (also on restart of a thread) would generate a new transactonal.id.

Having no pending transactions is a requirement for the following code to do a correct verification of the result.

mjsax · 2020-04-07T21:05:39Z

Just increasing the wait time as small side improvement to spin less and reduce the output for debugging.

mjsax · 2020-04-07T21:06:36Z

Because we do the verification for pending transactions first now, we have one additional record that is not part of the result and that we need to exclude (similar below)

The logic looks fragile when the partitionRecords is empty. For all -1 cases, we add one more dummy record to the array being checked, or just remove the last element from the derived array so that we could maintain the same verification.

I'm wondering if we have to produce another record in the verifyAllTransactionFinished; for example, could we just check the end value of the newly added offsets map maintained by producer? If yes then we can remove this extra logic here and below.

The logic looks fragile when the partitionRecords is empty

partitionRecords would never be empty; it would at least contain the "dummy" record that we wrote previously

(We could of course strip the dummy record somewhere else; I picked this solution because it was the least line of code to be changed.)

for example, could we just check the end value of the newly added offsets map maintained by producer?

Well, we can only maintain this new offset map if we write those dummy records. If we don't write anything, the producer would not put anything into the map but it would stay empty?

(Or do you refer to the offsets map from the producer that write the input data? This would not help because the generate() and verify() methods are not executed in the same JVM -- also, we are interested in pending transaction of repartition and output topic -- for input topics, there are not transaction.)

However, maybe we could use two consumers (and get rid of the producer): one in read_uncommitted mode to get the endOffset and a second one with read_committed mode that also get the end-offsets in a loop. Only if the "read committed" consumer returns the same end-offset as the "read uncommitted" consumer did, we know that there is no pending transaction?

Thoughts?

Yes I was thinking about the added offsets map from the generate() function -- you're right they would not be shared for the other. Bummer..

I think using two consumer is slightly better than using a producer to write a dummy record.

mjsax · 2020-04-07T21:07:45Z

As there might be pending transaction, we need to improve the way how we verify that all transaction are finished. For this, we need to remember the offsets of our "topic end marker messages".

See my question above: could we just rely on the maintained offsets map last values?

mjsax · 2020-04-07T21:09:50Z

Instead of looking for the end-marker message per content (ie, comparing key and value), we now use the offset (that we now know) to see if we can get the endOffset() as expected in "read_committed" mode.

mjsax · 2020-04-07T21:10:46Z

seekToEnd() will only reach the end-marker in read-committed mode, if there is no pending transaction.

mjsax · 2020-04-07T21:11:33Z

Strictly, position should be exactly endMarkerOffset + 1 -- it seems ok to just check for >

Why we want to relax this check here?

No reason. I can make it strict, too.

mjsax · 2020-04-07T21:12:38Z

For debugging purpose, we now track the smallest and largest processed offset, too. This helps to understand which task during which phase processed which part of the data.

mjsax · 2020-04-07T21:12:58Z

Java8 cleanup

mjsax · 2020-04-07T21:13:17Z

Removing unused method.

mjsax · 2020-04-07T21:14:03Z

Config verification step.

mjsax · 2020-04-07T21:14:41Z

We know set the processing guarantee via the properties file (that allows us to easily parametrize the test)

guozhangwang · 2020-04-08T00:08:05Z

streams:checkstyleTest failed.

mjsax · 2020-04-08T17:15:16Z

Build failed with

[ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.13/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java:332:5: NPath Complexity is 588 (max allowed is 500). [NPathComplexity]

Seems to be fixed via #8447

Can retest after initial reviews.

mjsax · 2020-04-10T19:25:03Z

Retriggered a single system test run as "sanity check" for now: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3898/

mjsax · 2020-04-10T20:29:50Z

Sanity run passed. Triggered another 20: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3899/

abbccdda · 2020-04-11T17:35:06Z

Unit tests are failing

mjsax · 2020-04-13T21:07:47Z

+                        .stream()
+                        .filter(t -> !e.corruptedTaskWithChangelogs().containsKey(t.id()))
+                        .collect(Collectors.toSet())
+                );


If we hit a TaskCorruptedException, we know that only a task in restore mode could be affected and those don't have anything to be committed (their commitNeeded flag should be set to false). Hence, we just commit all non-corrupted tasks. Afterwards we can safely call handleCorruption() (if we don't commit, we might abort a pending transaction for eos-beta incorrectly within handleCorruption())

\cc @abbccdda @guozhangwang

mjsax · 2020-04-13T21:10:36Z

Triggered another 20 runs: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3903/

abbccdda · 2020-04-13T21:24:52Z

    }

-    private int commitInternal(final Collection<Task> tasks) {
+    int commitInternal(final Collection<Task> tasks) {


Should we consider adding a unit test here, since this call is externalized?

Good call.

I also need to add a unit test that we actually commit all other tasks if a TaskCorruptedException is thrown. Just wanted to get the suggested fix reviewed first (also tested via system test run) before I close the unit test gaps.

guozhangwang · 2020-04-13T21:34:58Z

I think under eos-beta, if one task failed fatally then we do have to close all tasks since their shared producer. It is just a special case for TaskCorrupted since we know it could only be thrown from a restoring task and hence that has nothing to commit.

guozhangwang · 2020-04-13T22:49:19Z

I think under eos-beta, if one task failed fatally then we do have to close all tasks since their shared producer. It is just a special case for TaskCorrupted since we know it could only be thrown from a restoring task and hence that has nothing to commit.

Thinking about this a bit more, I think the general rule would be: if the failing task is in the RUNNING state, then it is possible that it has already used the shared producer to send some data, which needs to be aborted; hence we have no other choice but to abort other tasks as collateral damage. If the failing task is in other states (e.g. only RESTORING tasks could throw TaskCorruptedException today) then we know that task has not used the shared producer, and hence we can skip aborting the txn. Does that make sense? @mjsax

mjsax · 2020-04-13T23:14:52Z

@guozhangwang Both your statements make sense. I just have the impression that the logic to follow those patterns is scattered throughout the code base. atm Hence, I would suggest to put this logic into a single place, ie, the TaskManager. (This would be work for a follow up refactoring PR though.)

I hope that the current fix is "good enough" for now to move forward with this PR.

guozhangwang · 2020-04-14T00:04:18Z

LGTM!

mjsax · 2020-04-14T01:46:00Z

System test branch builder is buggy atm -- just pushed a fix that should make it work -- we might need to revert that fix before merging.

Next try for system tests (20 runs): https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3905/

This reverts commit d506697.

mjsax · 2020-04-15T15:45:09Z

Java 8 passed.
Java 11 timed out.

Retest this please.

mjsax added streams tests Test fixes (including flaky tests) labels Apr 7, 2020