KAFKA-8755: Fix state restore for standby tasks with optimized topology by cpettitt-confluent · Pull Request #7238 · apache/kafka

cpettitt-confluent · 2019-08-22T17:21:34Z

Key changes include:

Moves general offset limit updates down to StandbyTask.
Updates offsets for StandbyTask at most once per commit and only when
we need and updated offset limit to make progress.
Avoids writing an 0 checkpoint when StandbyTask.update is called but
we cannot apply any of the records.
Avoids going into a restoring state in the case that the last
checkpoint is greater or equal to the offset limit (consumer committed
offset). This needs special attention please. Code is in
StoreChangelogReader.
Does update offset limits initially for StreamTask because it
provides a way to prevent playing to many records from the changelog
(also the input topic with optimized topology).

NOTE: this PR depends on KAFKA-8816, which is under review separately. Fortunately the changes involved are few. You can focus just on the KAFKA-8755 commit if you prefer.

@guozhangwang @mjsax @cadonna

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

cpettitt-confluent · 2019-08-22T19:33:05Z

If this all looks good from a design / implementation perspective I'd like to get a few additional unit tests in on the changes too. I'll optimistically start working on those, but I'm happy to dump/rework them if we need to change the implementation.

cpettitt-confluent · 2019-08-22T21:10:58Z

I've added additional test cases and comments. I'm not anticipating any more changes to this code unless I get review feedback requesting changes.

cpettitt-confluent · 2019-08-23T14:49:08Z

The RebalanceSourceConnectorsIntegrationTest seems to be flaky - I can sometimes get it to fail with or without this patch locally. The kafka tests are below this change, not sure what happened there. The failure of the standby task test in one instance is concerning though. I want to spend some time digging through the logs to see if it looks like it is related to the kafka issues or if it is inherent to the test, or worse the code.

For now, optimistically running another pass to see if the kafka issues get sorted out, while I investigate:

retest this please

cpettitt-confluent · 2019-08-23T21:31:47Z

The failure for JDK 8 / Scala 2.11 looks legit. For some reason we're still not getting the standby task to checkpoint. Possibly another bug lurking that I have not seen locally. Will do more investigation.

cpettitt-confluent · 2019-08-23T21:51:56Z

It turns out there is no standby. Both partitions are getting assigned to the same instance. When the other task joins between PARTITIONS_ASSIGNED and RUNNING the partitions are not reassigned. Something is up in this phase and consequently the standby code is not even being used.

cadonna

Thanks for the PR @cpettitt-confluent.

cadonna · 2019-08-26T20:33:17Z

According to the comment on line 158, the consumer API does not guarantee that endOffset is not null. The javadoc of KafkaConsumer#endOffsets() say that there will be a non-null endOffset for each given partition (at least in my interpretation). @guozhangwang, WDYT?

Please remove comment once this issue is solved.

EndOffsets here should not be null, since if consumer.position timed out it would return early and not arrive here.

Please see my other comment about the awkwardness of remembering offset limit during restoration phase, which I think is avoidable.

cadonna · 2019-08-26T20:55:16Z

The following two tests have rather much code in common. Please try to extract common parts to methods.

cadonna · 2019-08-26T20:58:29Z

Why AtomicInteger?

Is the question:

Why AtomicInteger instead of a larger type? I'm not too worried about rollover here, but happy to make it long if you prefer.

Why AtomicInteger instead of AtomicBoolean? I want to make sure we don't make a second consumer.committed() call.

Why AtomticInteger instead a primitive type? I want to be able to set the value from a closure. I haven't done a lot of Java coding in the last year or two, so if you know a way to do it let me know and I will be happy to adopt it.

cpettitt-confluent · 2019-08-26T21:30:05Z

Thanks for the review @cadonna!

Since my other PR went in it looks like I will need to rebase this one. I will also adopt changes above where mentioned - I think at least one comment I was not sure how to interpret. I also found at least one more failure case for standby state restore that I want to get into a follow up patch to this PR.

guozhangwang

Made a pass over the non-testing code.

guozhangwang · 2019-08-26T21:24:03Z

I think we only need to refresh offset limit if !remainingRecords.isEmpty meanings that some records cannot be updated, to replace the first two conditions.

Also since this should only happen when partition is part of sourceChangelogPartitions (otherwise offset limit should always be MAX_VALUE and the remaining records should always be empty), so instead of containing it as a condition, we can just check if it is not the case and throw indicating a bug.

think we only need to refresh offset limit if !remainingRecords.isEmpty meanings that some records cannot be updated, to replace the first two conditions.

@guozhangwang Can do. This says something a little different than the current code. This says if there were any records that could not be played then refresh the consumer committed offset. The current code says refresh the consumer committed offset only if we could not play any of the records. In other words, we may refresh more frequently. Sound good?

Also since this should only happen when partition is part of sourceChangelogPartitions (otherwise offset limit should always be MAX_VALUE and the remaining records should always be empty), so instead of containing it as a condition, we can just check if it is not the case and throw indicating a bug.

Makes sense, I will treat this case as a bug and throw.

I ended up not using remainingRecords.isEmpty, but am open to switching to that. Let me explain why I didn't pick this up. IIUC, the behavior you're looking for is to make the consumer.committed query if any records cannot be applied because the offset limit is too low. The new code does this. However, if I use remainingRecords.isEmpty I have to do up to 2 full passes on the records: once to determine that all records are blocked and the second to replay what is no longer blocked. Assuming records are ordered by offset we can achieve this in one pass as in the updated code. Please let me know my assumption is incorrect, I misunderstood your goal, or you'd still prefer the two pass approach.

To simplify this, should we just check if records.isEmpty() in the very beginning and return immediately?

I also think, this code may still do two passed, as we have no guarantee that get() will not need to traverse the list anyway. To get a single pass, I would rather piggy pack updating the offset limit to the for loop below (we would change if to a proper Iterator that we advance manually: if we hit the else, we try to update the offset limit and retry the current record without advancing the iterator.

guozhangwang · 2019-08-26T21:26:09Z

Why do we need this for active stream task? Only standby task need to maintain the offset limits right?

guozhangwang · 2019-08-26T21:26:32Z

With the other PR merged in, this seems redundant.

Right, I can rebase this away now. I needed it to get the test to pass previously.

This is a response to your previous question (can't comment there for some reason):

Why do we need this for active stream task? Only standby task need to maintain the offset limits right?

This limit was the easiest way I could find to prevent the changelog restorerer from restoring beyond the committed offset. IIRC, without this change we end up playing all of the input topic messages to load up the store before even starting to process new records, which gave us that KTable out of order error. I believe we were playing all messages again after failover, but I would have to revert to verify.

If there is a better way to prevent the above, I would be happy to adopt it. It was definitely my intention to get rid of end offsets from stream task altogether.

guozhangwang · 2019-08-26T21:35:02Z

I think this is part of the reason why I'd advocate for removing the standby task's restoration phase all together, because:

only standby task needs to maintain offset limit, for stream task the offset limit should always be MAX_VALUE.

since we have a restoration phase before the running phase of the standby task, we need to keep this offset limit as part of the changelog-reader as well to make sure that we do not restore beyond the limit (in fact, the checkpoint offset should never be larger than the offset limit, at most be equal to it).

@guozhangwang this is actually on the active task (StreamTask). This is probably the most minor of the list of issues, but this branch fixes calling restore at all on the active task if the standby was all caught up immediately before it became active.

As we discussed earlier, the standby doesn't do restore at all. Unfortunately, a lot of these hoops seem necessary given the current design. I do have some thoughts on how to simplify it if you have time to discuss at some point.

BTW, to be clear, I mean the restore at task startup, which is what we discussed yesterday. We seem to have overloaded the term "restore" as we also call it "restore" when we replay consumed events during normal operation...

guozhangwang · 2019-08-26T21:40:20Z

EndOffsets here should not be null, since if consumer.position timed out it would return early and not arrive here.

Please see my other comment about the awkwardness of remembering offset limit during restoration phase, which I think is avoidable.

guozhangwang · 2019-08-26T21:49:03Z

I think this logic is still buggy that we need to fix: we should check if the partition is part of sourceChangelogPartitions here, and:

if not, then we do not need to read the limit at all since there should be none.

if yes, then we should initialize the limit to zero to block any updates since there should always be some value, and if the committed values are not read yet then initializing it to MAX_VALUE has a risk of going beyond the limit still.

@guozhangwang I pre-initialized the source topics that are changelogs in initializeStateStores. Since the other partitions are not registered we get back MAX_VALUE (default for missing value in stateMgr). I think the only cases we end up using the offset limit right now are during standby -> active restore and to prevent the standby from getting ahead of the active consumer. If that's also your understanding then my inclination would be to move offset limit entirely into StandbyTask and to explicitly handle the standby -> active transition in StreamTask. The benefit is we have less inter-class collaboration, more explicit handling of various limits, and voiding overloading offset limit for yet something else :). WDYT?

guozhangwang

I have a meta comment: I think remembering the offset limit during restoration phase is a bit awkward (not introduced by your PR, but just how it's written today). I'd still suggest we skip the restoration phase of the standby task as a whole (we can do the refactoring later if you like, and for now just let StandbyTask#initializeStateStores to return false so that the task would move to running immediately. Doing this we can actually remove the offset limit from restorer all together since they are only needed for standby tasks during restoration, and if we enforce the restoration to never happen then it is useless.

ping @mjsax to have another look as well.

cpettitt-confluent · 2019-08-26T22:48:14Z

Thanks @guozhangwang! Per our discussion I will drop restore from standby task as it is only complicating things. I may also break this up into a few different commits to make it easier to consume.

Lastly, there is still a way to get the standby task wedge with the current state of this commit. I have already identified the issue and I'm working through the fix.

For now it is probably best to hold off on further reviews as things are about to shift a fair amount.

cpettitt-confluent · 2019-08-28T17:53:48Z

@guozhangwang @mjsax @cadonna

I've pushed a different version of this patch to hopefully reduce confusion. I'm attempting to push offset limits as low as I can to make their use in both tasks clearer.

Things that could make this cleaner:

Replacing Task.commit and Task.commitNeeded with Task.commitIfNeeded. See note in patch.
A more direct way to tell StateRestorerer which offset to restore to. Really it should not care about the distinction between offsetLimit and endOffset. That logic should be up to the caller!

I have a few block comments in this commit. I received feedback last time that "we don't often do comments." Seeing that this is complicated enough that it has required a lot of back and forth, I am ready to debate the worthiness of comments in the code :).

cpettitt-confluent · 2019-08-29T01:14:50Z

JDK 8 / Scala 2.11 timed out, which is concerning:

Build timed out (after 240 minutes). Marking the build as aborted.
Build was aborted
[FINDBUGS] Skipping publisher since build result is ABORTED
Recording test results
Setting MAVEN_LATEST__HOME=/home/jenkins/tools/maven/latest/
Setting GRADLE_4_10_3_HOME=/home/jenkins/tools/gradle/4.10.3

JDK 11 / Scala 2.12 failed with:

kafka.api.SaslOAuthBearerSslEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl

Which seems unrelated to my change. I'll check the flaky tests list and try to repro locally.

On the positive side, my new tests seem more robust - they passed on JDK 8 / Scala 2.11. The new design may be useful elsewhere.

cpettitt-confluent · 2019-08-29T16:51:01Z

Confirmed SaslOAuthBearerSslEndToEndAuthorizationTest is flaky (ticket was marked dup of KAFKA-8800).

Retest this please.

Prior to this change, there were a few bugs: 1. Records were never applied to the standby store if the topology was optimized. 2. Even if they had been, during failover the standby store would checkpoint with offset 0, effectively requiring a full rebuild of the state store. 3. After failover, the now active task would replay all records from the source topic, which could leave it ahead of the consumer committed offset. All of the above are addressed in this patch. In an attempt to keep things simpler for review, this patch does not address another issue uncovered during investigation: when the standby transitions to active the restore operation is started even when the store is fully caught up. For example, if the last consumer committed offset was 100 and the store was at 100, we still see a restore operation with start and end at 100.

cpettitt-confluent · 2019-09-04T16:45:38Z

Only seems to be trying to run tests on one of the jenkins boxes?

Retest this please.

mjsax

Did a partial initial pass.

mjsax · 2019-09-05T00:14:59Z

To simplify this, should we just check if records.isEmpty() in the very beginning and return immediately?

I also think, this code may still do two passed, as we have no guarantee that get() will not need to traverse the list anyway. To get a single pass, I would rather piggy pack updating the offset limit to the for loop below (we would change if to a proper Iterator that we advance manually: if we hit the else, we try to update the offset limit and retry the current record without advancing the iterator.

mjsax · 2019-09-05T00:25:03Z

-            } else {
-                log.info("End offset cannot be found form the returned metadata; removing this partition from the current run loop");
+            final StateRestorer restorer = stateRestorers.get(topicPartition);
+            if (restorer.checkpoint() >= endOffset) {


Thinking about this, the checkpoint should never be larger than the endOffset? Ie, is restorer.checkpoint() > endOffset actually an error condition and we should wipe out the state and recreate from scratch?

Totally agree from a correctness perspective. Do you think we have enough test coverage to ensure this change is safe?

Looking at this one a bit more, I think it might be a good idea to defer this change. It's pretty orthogonal to the changes I'm doing (this code was just moved in this patch). I can create a ticket if you like?

Maybe. Would be good to get input from @guozhangwang about it -- maybe >= is actually correct and I am just missing something...

mjsax · 2019-09-05T00:27:20Z

+                restorer.setRestoredOffset(restorer.checkpoint());
                iter.remove();
+                completedRestorers.add(topicPartition);
+            } else if (restorer.offsetLimit() == 0 || endOffset == 0) {


This conditions seems to be redundant as we compute Math.min() above anyway. I think we can simplify to endOffset == 0.

Actually, I am wondering if we can unify the first and second branch:

final int currentOffset = Math.max(restorer.checkpoint(), 0); if (currentOffset == endOffset) { restorer.setRestoredOffset(currentOffset); iter.remove(); completedRestorers.add(topicPartition); } else { restorer.setEndingOffset(endOffset); }

Sounds great!

cpettitt-confluent · 2019-09-05T15:32:56Z

Thanks @mjsax!!! Lots of great feedback. I'll get the patch back up with fixes today.

guozhangwang

Made another pass on the PR, left some comments (one of them seems affecting correctness).

@cpettitt-confluent could you also trigger a system test to double check?

guozhangwang · 2019-09-05T23:27:43Z

+        // stream task to only play records up to the last consumer committed offset. Here we find
+        // partitions of topics that are both sources and changelogs and set the consumer committed
+        // offset via stateMgr as there is not a more direct route.
+        final Set<String> changelogTopicNames = new HashSet<>(topology.storeToChangelogTopic().values());


With EOS, Consumer.committed maybe taking long time as the previous txn is being completed, we need to make sure that it would not throw any unexpected exceptions and if so we should not fail the streams instance. cc @abbccdda @mjsax .

This code should be functionally equivalent to the previous code. Previously we initialized the offset limits in registerStateStores, which is called by initializeStateStores here. The biggest difference is that we no longer do this immediately for standby tasks - we defer to the first time we need a new offset limit to apply a record.

The only exception that is non-fatal should be TimeoutException?

Yeah this is not a comment for this PR but more or less an FYI for folks who are working on KIP-447 :) I also left some thoughts on the voting thread regarding this purpose.

cpettitt-confluent · 2019-09-06T15:27:36Z

Made another pass on the PR, left some comments (one of them seems affecting correctness).

@cpettitt-confluent could you also trigger a system test to double check?

@guozhangwang, which comment do you think affects correctness? My guess is your question about whether we default to Long.MAX_VALUE for offset limit for the first time. Hopefully my response there convinces you it works correctly. If something else, please let me know.

W.r.t. system test, is that something beyond what we're doing on Jenkins? If so, is it basically following this doc: https://github.com/apache/kafka/blob/trunk/tests/README.md. If so, is there a subset you want me to run, or the whole suite?

mjsax · 2019-09-06T18:07:27Z

+
+    long committedOffsetForPartition(final TopicPartition partition) {
+        try {
+            final OffsetAndMetadata metadata = consumer.committed(partition);


@guozhangwang With regard to you comment about exceptions for EOS -- it seem only relevant to catch and swallow a TimeoutException here? Or should the caller be responsible to decide if a timeout should be swallowed? Or do you have anything else in mind?

I think TimeoutException is the one, yes, but more importantly we should see if our timeout values are set reasonably in Streams :)

guozhangwang · 2019-09-06T18:35:03Z

@guozhangwang, which comment do you think affects correctness? My guess is your question about whether we default to Long.MAX_VALUE for offset limit for the first time. Hopefully my response there convinces you it works correctly. If something else, please let me know.

Yes that's the one and yours makes sense!

W.r.t. system test, is that something beyond what we're doing on Jenkins? If so, is it basically following this doc: https://github.com/apache/kafka/blob/trunk/tests/README.md. If so, is there a subset you want me to run, or the whole suite?

You can trigger a system test out of any branch at https://jenkins.confluent.io/job/system-test-kafka-branch-builder/

where you can set the repo and branch name to yours, and then we can just run all the system test by having tests/kafkatest/tests/streams :)

mjsax

I was wondering if we could add an integration test that verifies that a StandBy does not "over take" the active task, ie, is blocked by the committed offsets -- not sure how such a test could be written though atm.

mjsax · 2019-09-06T18:44:54Z

+            .toStream()
+            .peek((k, v) -> semaphore.release());
+
+        final KafkaStreams client1 = createClient(builder, streamsConfiguration());


nit: client1 -> kafkaStreams1
nit createClient -> createKafkaStreams

(similar for all other variables eg clientStates etc)

It's technically a client, but also consumer and producer are and we also use consumer/producer as variable name there -- makes it easier to read -- otherwise it unclear what client it is.

Will do. Thanks!

Any objection to calling it an instance?

While I think it would be fine, it would be "new" -- if we think instance is better, we might want to migrate all other code lazily to use instance, too. It's always best to have the same naming conventions throughout the whole codebase IMHO.

+1

I liked the name instance because it matched the documentation and was unambiguous. On the other hand, we're actually creating an object of type KafkaStreams, so that is a pretty strong argument to use the convention you originally suggested.

I'll switch instanceXXX to kafkaStreamsXXX and submit a follow up patch.

mjsax · 2019-09-06T18:49:35Z

+
+        produceValueRange(key, 0, 10);
+
+        assertThat("all messages in the first batch were processed in a timely manner",


Should the error message not point out what went wrong, ie, "messages in the first batch were [not] processed in a timely manner" -- same below

I can change that. It reads a little funny "assert that messages in the first batch were not processed in a timely manner", but this should follow project convention.

Looking at this again, it is going to confuse the heck out of me if I try to reason about these messages done in the negated form. I'm inclined to pull the messages and perhaps just put a comment above them instead.

In most cases we don't have any message, so should be fine to remove. I see your point about assert that bla -- however, I think if the assertion hits, the error message reads different (ie, with reversed logic) and hence rephrasing would make it easier to read the error message if it fails (please correct me if I am wrong).

I figured I would test it for my own knowledge.

Test case:

public class QuickTest { @Test public void testAssertThatTrueIsEqualToFalse() { boolean result = false; MatcherAssert.assertThat("the result is true", result, is(equalTo(true))); } }

Result:

java.lang.AssertionError: the result is true Expected: is <true> but: was <false> Expected :is <true> Actual :<false>

I think this actually works out. It is saying the assertion did not hold (AssertionError) and then provides the assertion in the positive form. It also spells out the expected and actual values.

That said, I'm not going to go change all of those back now :).

mjsax · 2019-09-06T18:55:36Z

+
+        final int totalNumMessages = batch1NumMessages + batch2NumMessages;
+
+        produceValueRange(key, batch1NumMessages, totalNumMessages);


Before we produce new messages, should we verify that the store was migrated correctly and contains store.get(key), is(equalTo(batch1NumMessages - 1)) ?

Yes, we can do that with a time limited loop that queries for that state.

@mjsax You may have uncovered yet another bug (BUG 6?) in optimized graph failover! I cannot seem to get any value for the key after failover. So, looks like this may involve yet more significant investigation. Thanks for the idea!

This test fails with non-optimized graphs too. I hope I'm doing something wrong...

whew... Apparently 10 seconds was not long enough to wait for the store to become available after failure. If I wait a bit longer it finally becomes available. Seems long though? This is the case for both non-optimized and optimized...

If case of failure, we detect the failure only after session.timeout.ms (default 10 seconds) hit -- to speed up the test, we could decrease the session timeout via StreamsConfig

I had the same exact thought. My only reservation was tweaking config too much might start to deviate too much from expected use. In other words, I'm totally for this if have a reasonable value in mind. I was thinking about 1s?

I'm going to try with 1s. If you have a better value please let me know as soon as you get a chance :)

We can try that -- but we need to reduce HEARTBEAT_INTERVAL_MS_CONFIG, too, for this case -- its default is 3 seconds and it must be smaller than session timeout.

Checking some existing tests, we have some that set them to 500ms / 1sec. Some other test reduce session timeout to 2 or 5 seconds...

mjsax · 2019-09-06T19:01:13Z

+        // task.commitNeeded and task.commit and instead just had task.commitIfNeeded. Currently
+        // we only call commit if commitNeeded is true, which means that we need a way to indicate
+        // that we are eligible for updating the offset limit outside of commit.
+        running.forEach((id, task) -> task.allowUpdateOfOffsetLimit());


Seems we should add AssignedStandbyTasksTest.java to unit test the new commit() behavior.

We could add a test here, but it is super implementation specific. This is not really the best home for the code, just the most convenient at the moment. All we would test, I think, is that we called allowUpdateOfOffsetLimit? The integration test covers this in a more comprehensive way and doesn't break when we move this code. What do you think?

All we would test, I think, is that we called allowUpdateOfOffsetLimit?

Yes, the test should be shouldUpdateOffsetsLimtOnCommit().

The integration test covers this in a more comprehensive way and doesn't break when we move this code.

"and doesn't break" -- I guess it should break? The main purpose of the unit test is, that is would be hard to debug the integration test and figure out the root cause why it fails.

I added back the StandbyTask tests that will cover this with a much more narrow test - specifically to cover when to update the offset limit. If this gets moved into StandbyTask that test should not fail, provided it was done right. If it's broken in any way that test will flag it and it should be easy to look at the test failure and the changed code to determine the cause.

Do you think StandbyTaskTest sufficiently addresses your concern?

guozhangwang · 2019-09-06T23:19:31Z

The PR lgtm now, thanks @cpettitt-confluent ! Leaving to @mjsax to merge when he feels comfortable and system test passed.

cpettitt-confluent · 2019-09-09T17:24:48Z

@mjsax I believe the latest patch addresses all of your feedback except for the one open question about testing that was left unresolved. Whenever you have time, this path is ready for review. Thanks!

cpettitt-confluent · 2019-09-10T15:24:02Z

The system test had one failure:

test_id:    kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=False
status:     FAIL
run time:   10 minutes 26.561 seconds


    2 acked message did not make it to the Consumer. They are: [7938, 7939]. The first 2 missing messages were validated to ensure they are in Kafka's data files. 2 were missing. This suggests data loss. Here are some of the messages not found in the data files: [7938, 7939]

Traceback (most recent call last):
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.7.6-py2.7.egg/ducktape/tests/runner_client.py", line 132, in run
    data = self.run_test()
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.7.6-py2.7.egg/ducktape/tests/runner_client.py", line 189, in run_test
    return self.test_context.function(self.test)
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.7.6-py2.7.egg/ducktape/mark/_mark.py", line 428, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/core/reassign_partitions_test.py", line 160, in test_reassign_partitions
    self.run_produce_consume_validate(core_test_action=lambda: self.reassign_partitions(bounce_brokers))
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/produce_consume_validate.py", line 115, in run_produce_consume_validate
    self.validate()
  File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/produce_consume_validate.py", line 139, in validate
    assert succeeded, error_msg
AssertionError: 2 acked message did not make it to the Consumer. They are: [7938, 7939]. The first 2 missing messages were validated to ensure they are in Kafka's data files. 2 were missing. This suggests data loss. Here are some of the messages not found in the data files: [7938, 7939]

This doesn't seem related to my change, as it is in core. I did find a closed flaky test ticket for this test: https://issues.apache.org/jira/browse/KAFKA-8044. I will see if I can re-run just this test, since the whole suite took 16 hours!

cpettitt-confluent · 2019-09-10T15:31:25Z

I'm seeing the same failure for other people's system test on Jenkins, suggesting that the ReassignPartitionsTest in Kafka core is in fact flaky again.

cpettitt-confluent · 2019-09-10T16:24:30Z

retest this please.

cpettitt-confluent · 2019-09-11T16:36:12Z

@mjsax new patch is up

cpettitt-confluent · 2019-09-11T16:43:37Z

System test for tests/kafkatest/tests/streams came back clean: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2999/

cpettitt-confluent · 2019-09-11T22:15:15Z

Well that blew up spectacularly :). Log download is slow so I'll probably get to reviewing it tomorrow morning. If it is related to the change, my guess would be that it's the lowered session and heartbeat timeouts.

cpettitt-confluent · 2019-09-12T15:29:28Z

I'm going to give this one more "retest this please" and if it blows up as badly I will revert the shorter timeouts (session / heartbeat). I didn't see anything about this test in the log I looked at.

cpettitt-confluent · 2019-09-12T21:24:49Z

Test timed out on JDK 11 / Scala 2.12. We're green on JDK 11/2.13 and JDK 8/2.11. I looked at the three failures earlier and they were all time outs. Going to try this one more time...

retest this please!

cpettitt-confluent · 2019-09-13T16:03:35Z

Sigh... It seems timeouts are a thing now. Same timeout on 11/2.12. retest this please

cpettitt-confluent · 2019-09-13T17:28:00Z

I realized I forgot to explicitly mention that the other two Jenkins tests 11/2.13 and 8/2.11 came back green on the last run. It was just 11/2.12 that timed out.

cpettitt-confluent · 2019-09-13T17:51:46Z

Per request, the following passed for all 20 runs:

for run in {1..20}; do ./gradlew streams:cleanTest && ./gradlew streams:test --tests OptimizedKTableIntegrationTest; done

guozhangwang · 2019-09-13T22:43:30Z

LGTM (the failed test are due to 240min time out again, while the other two takes 230min and 190min respectively).

cpettitt-confluent commented Aug 22, 2019

View reviewed changes

Comment thread streams/src/main/java/org/apache/kafka/streams/processor/internals/StoreChangelogReader.java Outdated

cpettitt-confluent commented Aug 22, 2019

View reviewed changes

Comment thread streams/src/test/java/org/apache/kafka/streams/integration/StandbyTaskTest.java Outdated

cadonna reviewed Aug 26, 2019

View reviewed changes

guozhangwang reviewed Aug 26, 2019

View reviewed changes

cpettitt-confluent force-pushed the kafka-8755 branch from bbdc598 to 82b378a Compare August 28, 2019 17:46

bbejeck added the streams label Sep 3, 2019

cpettitt-confluent added 2 commits September 4, 2019 10:38

Don't restore unnecessarily on standby -> active task transition

3d92400

cpettitt-confluent force-pushed the kafka-8755 branch from 80d5c2e to 3d92400 Compare September 4, 2019 16:43

mjsax reviewed Sep 5, 2019

View reviewed changes

cpettitt-confluent added 2 commits September 5, 2019 13:10

Fixes for review feedback

1f8a06b

Fix NPath checkstyle error

390606c

guozhangwang reviewed Sep 5, 2019

View reviewed changes

mjsax reviewed Sep 6, 2019

View reviewed changes

Comment thread streams/src/main/java/org/apache/kafka/streams/processor/internals/StandbyTask.java

mjsax reviewed Sep 6, 2019

View reviewed changes

Comment thread streams/src/main/java/org/apache/kafka/streams/processor/internals/StandbyTask.java

More review fixes

7a18866

cpettitt-confluent force-pushed the kafka-8755 branch from 56d71ac to 7a18866 Compare September 9, 2019 17:19

More review feedback fixes

0b14fbd

guozhangwang merged commit 83c7c01 into apache:trunk Sep 13, 2019


		produceValueRange(key, 0, 10);

		assertThat("all messages in the first batch were processed in a timely manner",


		final int totalNumMessages = batch1NumMessages + batch2NumMessages;

		produceValueRange(key, batch1NumMessages, totalNumMessages);

Conversation

cpettitt-confluent commented Aug 22, 2019

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

Uh oh!

cpettitt-confluent commented Aug 22, 2019

Uh oh!

cpettitt-confluent commented Aug 22, 2019

Uh oh!

cpettitt-confluent commented Aug 23, 2019

Uh oh!

cpettitt-confluent commented Aug 23, 2019

Uh oh!

cpettitt-confluent commented Aug 23, 2019

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpettitt-confluent commented Aug 26, 2019

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpettitt-confluent commented Aug 26, 2019

Uh oh!

cpettitt-confluent commented Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpettitt-confluent commented Aug 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guozhangwang left a comment •

edited

Loading

cpettitt-confluent commented Aug 28, 2019 •

edited

Loading

cpettitt-confluent commented Aug 29, 2019 •

edited

Loading

cpettitt-confluent commented Aug 29, 2019 •

edited

Loading