[KAFKA-7994] Improve Stream time accuracy for restarts and rebalances by ConcurrencyPractitioner · Pull Request #6694 · apache/kafka

ConcurrencyPractitioner · 2019-05-07T23:23:12Z

The issue for this PR could be found here:
https://issues.apache.org/jira/browse/KAFKA-7994?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)

As noted in the JIRA description, stream time is incorrectly set to -1 after rebalances and restarts. To help resolve this issue, one approach is to commit the individual partition time with last message processed for each RecordQueue. Hence, after a restart, we could set the partition time to the last committed partition time.

We would also have to forward timestamps down stream after the head of the DAG receives records that updates stream time and global time.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

ConcurrencyPractitioner · 2019-05-07T23:36:19Z

ping @mjsax @guozhangwang @vvcephei for review

ableegoldman · 2019-05-08T00:04:47Z

    @SuppressWarnings("unchecked")
    public boolean process() {
+        // if condition put here in case of restarts and rebalances to check for correct timestamp
+        if (recordInfo.queue() != null && partitionGroup.getPartitionTimestamp(recordInfo.partition()) == -1) {


nit: use RecordQueue.UNKNOWN instead of -1

No problem, could fix that.

ableegoldman · 2019-05-08T00:07:02Z

Thanks for the PR! I agree this seems like a straightforward patch but I'm wondering if we shouldn't try and think through the eos case a bit more? Or is there really no way to safely cover it as well?

ConcurrencyPractitioner · 2019-05-08T01:25:09Z

Hi @ableegoldman Thanks for reviewing! I was planning on attacking the eos case. But as you could guess from the code, trying to retrieve the metadata committed is not as simple as in the non eos case. I was hoping for some input on that. So some small amount of advice is greatly appreciated. :)

ConcurrencyPractitioner · 2019-05-09T05:14:43Z

Oh, just found out something. Regardless if it is eos case or not, calling KafkaConsumer#committed() should be sufficient. So there shouldn't be any big problems with this. I will try to add a test case. :)

ableegoldman · 2019-05-09T05:34:30Z

That sounds reasonable. Looks like the build failed on checkstyle, can you try running it?

+1 on adding a test case(s)

ableegoldman · 2019-05-09T05:38:23Z

+
+        // confirm that timestamp was correctly committed
+        assertTrue(Long.parseLong(task.consumer.committed(partition1).metadata())
+                   == task.getPartitionTime(partition1));


Can we add separate unit tests to confirm this produces the expected behavior? I think the JIRA had some examples highlighting why this is a problem, it would be good to convert those into tests to make sure we're really fixing the problem at hand :)

No problem. Added a new test case to confirm behavior.

ConcurrencyPractitioner · 2019-05-10T02:23:09Z

Alright, done. @mjsax Added a test case as well. Would be good if you could take a look. :)

ConcurrencyPractitioner · 2019-05-16T16:49:44Z

pinging @mjsax and @guozhangwang for review

ConcurrencyPractitioner · 2019-05-18T18:01:32Z

Oh sorry, my bad. Underestimated the scope of the PR. Sorry for pinging you guys. Will dig some more.

…er/kafka into kafka-7994

ConcurrencyPractitioner · 2019-08-13T16:34:59Z

Retest this please.

ConcurrencyPractitioner · 2019-08-14T18:22:08Z

pinging @mjsax @ableegoldman @abbccdda @guozhangwang for final review.

cadonna

Thank you for the PR @ConcurrencyPractitioner!
Here my comments:

cadonna · 2019-08-14T12:36:04Z

+import static org.apache.kafka.streams.integration.utils.IntegrationTestUtils.cleanStateBeforeTest;
+import static org.apache.kafka.streams.integration.utils.IntegrationTestUtils.getStartedStreams;
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.MatcherAssert.assertThat;


The order of the imports in Kafka Streams is usually as follows:

Kafka imports and 3rd-party imports in one block

a block of java.* imports

import static.

cadonna · 2019-08-15T08:26:06Z

+        task.commit();
+        assertEquals(DEFAULT_TIMESTAMP, task.decodeTimestamp(consumer.committed(partition1).metadata()));
+        // reset times here to artificially represent a restart
+        task.resetTimes();


Wouldn't creating a new task be better? AFAIK, that is what happens during a restart. No need to simulate anything. Furthermore, it avoids introducing a new method just for testing.

cadonna · 2019-08-15T08:39:56Z

+        // extract the committed metadata from MockProducer
+        final List<Map<String, Map<TopicPartition, OffsetAndMetadata>>> metadataList = 
+            producer.consumerGroupOffsetsHistory();
+        final String storedMetadata = metadataList.get(0).get("stream-task-test").get(partition1).metadata();


Would be good to extract "stream-task-test" to a member field of the test and use it in createConfig() and here.

cadonna · 2019-08-15T08:43:09Z

+        consumer.commitSync(offsetMap);
+
+        // reset times here to artificially represent a restart
+        task.resetTimes();


cadonna · 2019-08-15T09:09:53Z

    }
+
+    // visible for testing
+    String encodeTimestamp(final long partitionTime) {


I would put methods to write and read record metadata in their own classes. Those classes would be kind of SerDes for metadata. Such SerDes would make the code better testable and separates the concerns of a task and reading and writing metadata which are completely independent. It does not need to be done in this PR. I just wanted to mention it.

Yeah, that would probably be a good idea in the future.

cadonna · 2019-08-15T09:28:24Z

        EasyMock.expect(t2.partitions()).andReturn(t2partitions);
        EasyMock.expect(t2.changelogPartitions()).andReturn(Collections.emptyList());

+        t1.initializeTaskTime();


Please remove empty line before this line.

cadonna · 2019-08-15T09:33:50Z

+        assertThrows(errMessage, NullPointerException.class, () -> {
+            group.setPartitionTime(randomPartition, 0L);
+        });
+    }


This tests misses to verify whether streamTime is set or not.

Furthermore, I would write two (or three) distinct tests:

partitionTimestamp is set (could be further split for streamTime is set or not)

NullPointerException is thrown

cadonna · 2019-08-15T09:42:10Z

+        task.addRecords(partition1, singletonList(getConsumerRecord(partition1, DEFAULT_TIMESTAMP)));
+
+        task.process();
+        task.commit();


The code block from the beginning of the method until here can be extracted and re-used in this and the previous test methods.

cadonna · 2019-08-15T09:43:59Z

    }
+
+    @Test
+    public void testSetPartitionTimestamp() {


I think, we use should... for newly added test methods.

ConcurrencyPractitioner · 2019-08-15T17:00:32Z

@cadonna Alright, done.

ConcurrencyPractitioner · 2019-08-19T22:33:24Z

@mjsax @abbccdda @guozhangwang @ableegoldman
Another round of comments would be good or approval of PR if everything looks right. :)

marcospassos · 2019-08-23T20:20:27Z

This issue is the cause of critical bugs we recently faced up in our applications that rely on the SessionStore for processing retroactive events.

@mjsax do you think this fix can be included as part of 2.3.1?

ConcurrencyPractitioner · 2019-08-29T02:45:41Z

@mjsax @cadonna Any last comments?

ConcurrencyPractitioner · 2019-09-04T19:48:23Z

@mjsax pinging.

mjsax · 2019-09-05T00:50:38Z

@marcospassos I don't think that we will include it in 2.3.1 -- it's not really a bug fix but an improvement.

@ConcurrencyPractitioner I try to review again in the next days.

cadonna

@ConcurrencyPractitioner Sorry for the delay.

cadonna · 2019-09-03T14:11:38Z

+        if (partitionQueues.get(partition) == null) {
+            throw new NullPointerException("Partition " + partition + " not found.");
+        }
+        return partitionQueues.get(partition).partitionTime();


Here it would be better to call partitionQueues.get(partition) only once and store its result in a variable. Then check the variable for null and call partitionTime() on the variable.

cadonna · 2019-09-03T14:12:03Z

+        if (streamTime < partitionTime) {
+            streamTime = partitionTime;
+        }
+        partitionQueues.get(partition).setPartitionTime(partitionTime);


Same as above.

cadonna · 2019-09-05T11:18:20Z

+        final ByteBuffer buffer = ByteBuffer.allocate(9);
+        buffer.put(LATEST_MAGIC_BYTE);
+        buffer.putLong(partitionTime);
+        return Base64.getEncoder().encodeToString(buffer.array());


I am wondering whether we can do better here. Encoding partition time in Base64 seems to me a bit a waste of space. As far as I can see, a 8 byte value is encoded in 11 bytes with Base64. Would be great, if we could store partition time in 8 bytes.
I am also wondering why metadata in OffsetAndMetadata is a String and not something more bytes friendly.

@cadonna Yeah, it is still unclear at this point if the metadata field in OffsetAndMetadata could be used in this manner. @guozhangwang or @hachikuji knows this matter better. Anyhow, OffsetAndMetadata right now is the only medium through which we can checkpoint partition time anyways. So we might be stuck with using the metadata field.

I don't have the full context on the history, but it would not be easy to change the API... I talked to Jason about it, and it seem we can just move forward with this PR as-is, and could do a KIP later that allows us to store metadata as byte[] type if we really need to change it. Atm, the metadata is just a few bytes and the overhead does not really matter IMHO.

ConcurrencyPractitioner · 2019-09-07T03:39:25Z

Retest this please.

mjsax · 2019-09-11T18:25:53Z

     */
    // visible for testing
-    void commit(final boolean startNewTransaction) {
+    void commit(final boolean startNewTransaction, final Map<TopicPartition, Long> partitionTimes) {


I know that I recommended to add this parameter, but now, after more refactoring of the code, I am not sure any longer why we need it? It seems that this method is called twice and both calls pass in the result of extractPartitionTimes() as parameter -- hence, it seems we can remove the parameter and do the call to extractPartitionTimes() within the method itself?

mjsax · 2019-09-11T18:37:46Z

    }

+    private void initializeCommittedTimestamp(final TopicPartition partition) {
+        final OffsetAndMetadata metadata = consumer.committed(partition);


This is a blocking call, and @guozhangwang just proposed KIP-520 to make it more efficient by allowing to pass in multiple partitions at once. Should we wait for KIP-520 to be implemented? If now, we should make sure the update this code after KIP-520 is merged.

I am also wondering how we should handle TimeoutException for this call? Maybe not, but might be worth to clarify?

\cc @guozhangwang

In my PR (#7304) I've refactored this part in StreamTask. I'd suggest we merge that one before this.

Just realized I need to do another rebase on my PR. So if this PR is closer to be merged I'd suggest @RichardYuSTUG @mjsax you guys just move forward and I will rebase mine later.

@mjsax Cool, sounds good. In that case, we could get this one merged since it is about complete.

mjsax · 2019-09-11T23:03:23Z

+        final ByteBuffer buffer = ByteBuffer.allocate(9);
+        buffer.put(LATEST_MAGIC_BYTE);
+        buffer.putLong(partitionTime);
+        return Base64.getEncoder().encodeToString(buffer.array());


I don't have the full context on the history, but it would not be easy to change the API... I talked to Jason about it, and it seem we can just move forward with this PR as-is, and could do a KIP later that allows us to store metadata as byte[] type if we really need to change it. Atm, the metadata is just a few bytes and the overhead does not really matter IMHO.

cadonna · 2019-09-12T08:43:57Z

@ConcurrencyPractitioner All builds reported SpotBug issues.

ConcurrencyPractitioner · 2019-09-12T21:40:26Z

Yeah, got it fixed.

cadonna · 2019-09-13T11:55:33Z

The following test failures seem related:

org.apache.kafka.streams.integration.ResetPartitionTimeIntegrationTest.shouldPreservePartitionTimeOnKafkaStreamRestart[0: eosEnabled=false]
org.apache.kafka.streams.integration.ResetPartitionTimeIntegrationTest.shouldPreservePartitionTimeOnKafkaStreamRestart[1: eosEnabled=true]

ConcurrencyPractitioner · 2019-09-13T14:20:46Z

@cadonna Oh, just realized that @mjsax's comment caused a regression. If you would look earlier in the conversation, you would find a segment where a call to close() resets all partition times to negative one. Therefore, we need to store the partition times in a map before they are reset, and then they are passed into the commit() method.

The extra parameter is needed after all due to the order of operations in close(). We will need to rollback some changes.

mjsax · 2019-09-16T04:55:22Z

    // visible for testing
    void suspend(final boolean clean,
                 final boolean isZombie) {
+        final Map<TopicPartition, Long> partitionTimes = extractPartitionTimes();


I remember now -- can we add a comment to explain that we need to get partitionTimes before we closeTopology() (sorry for my previous comment -- forgot about that)

Cool, got it done.

cadonna

LGTM, Thank you @ConcurrencyPractitioner

mjsax · 2019-09-19T18:53:29Z

Thanks for the hard work @ConcurrencyPractitioner!

ConcurrencyPractitioner added 5 commits April 17, 2019 13:45

[KAFKA-6520] Add DISCONNECTED state to Kafka Streams

7e32890

Adding better denonimator

00a2316

Removing state

c360a6b

[KAFKA-7994] Improve Stream-time for rebalances and restarts

5798ffb

Adding nullpointer check

060870f

ableegoldman reviewed May 8, 2019

View reviewed changes

Adding static check

8a50d19

mjsax added the streams label May 8, 2019

Adding simplification

6230f54

ableegoldman reviewed May 9, 2019

View reviewed changes

ConcurrencyPractitioner added 2 commits May 8, 2019 22:50

Adding checkstye

d003a26

Adding test and bug fix

74633c9

ConcurrencyPractitioner added 3 commits May 9, 2019 19:25

Removing unneeded methods

db9ef18

Switching to consistent variable system

fd06e3a

Adding log statement

baabdd3

Clearing up bugs

0e1cedc

ConcurrencyPractitioner added 6 commits May 19, 2019 17:13

Possible approach of transmitting timstamp

4e6f184

Merge branch 'trunk' into kafka-7994

f9435a3

Adding partition group stream time update

9883f4a

Merge branch 'kafka-7994' of https://github.com/ConcurrencyPractition…

c3da580

…er/kafka into kafka-7994

Updating and fixing error

ffe3c53

Reversing changes made in issue

9b37dde

cadonna reviewed Aug 15, 2019

View reviewed changes

Breaking up tests

3b3a2e3

cadonna reviewed Sep 5, 2019

View reviewed changes

Fixing minor details

f96d8bb

Fixing static error

27189c4

mjsax reviewed Sep 11, 2019

View reviewed changes

removing extraneous input arg

d52a09b

removing redundant store

c5f0e11

ConcurrencyPractitioner added 3 commits September 13, 2019 14:05

Redoing neccessiting

0191d8b

Merge branch 'trunk' into kafka-7994

086aed7

Removing unneeded imports

31daa81

mjsax reviewed Sep 16, 2019

View reviewed changes

Adding comment to clarify extra method parameter

b42a883

cadonna approved these changes Sep 18, 2019

View reviewed changes

mjsax merged commit 73c6bd8 into apache:trunk Sep 19, 2019

mjsax mentioned this pull request Oct 8, 2019

MINOR: unify calls to get committed offsets and metadata #7463

Merged

Conversation

ConcurrencyPractitioner commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

ConcurrencyPractitioner commented May 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman commented May 8, 2019

Uh oh!

ConcurrencyPractitioner commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ConcurrencyPractitioner commented May 9, 2019

Uh oh!

ableegoldman commented May 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ConcurrencyPractitioner commented May 10, 2019

Uh oh!

ConcurrencyPractitioner commented May 16, 2019

Uh oh!

ConcurrencyPractitioner commented May 18, 2019

Uh oh!

ConcurrencyPractitioner commented Aug 13, 2019

Uh oh!

ConcurrencyPractitioner commented Aug 14, 2019

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ConcurrencyPractitioner commented Aug 15, 2019

Uh oh!

ConcurrencyPractitioner commented Aug 19, 2019

Uh oh!

marcospassos commented Aug 23, 2019

Uh oh!

ConcurrencyPractitioner commented Aug 29, 2019

Uh oh!

ConcurrencyPractitioner commented Sep 4, 2019

Uh oh!

mjsax commented Sep 5, 2019

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ConcurrencyPractitioner commented May 7, 2019 •

edited

Loading

ConcurrencyPractitioner commented May 8, 2019 •

edited

Loading

cadonna Aug 15, 2019 •

edited

Loading

cadonna commented Sep 12, 2019 •

edited

Loading