KAFKA-7895: Fix stream-time reckoning for suppress by vvcephei · Pull Request #6278 · apache/kafka

vvcephei · 2019-02-15T19:56:40Z

Add suppress to system tests
Move stream-time reckoning from Task into Processor

Even within a Task, different Processors have different perceptions
of time, due to record caching on stores and in suppression itself,
and in general, due to any processor logic that may hold onto
records arbitrarily and emit them later. Thanks to this, we can't rely
on the whole task existing in the same "instant" of stream-time. The
solution is for each processor node that cares about stream-time to
track it independently.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

* Add suppress to system tests * Move stream-time reckoning from Task into Processor Even within a Task, different Processors have different perceptions of time, due to record caching on stores and in suppression itself, and in general, due to any processor logic that may hold onto records arbitrarily and emit them later. Thanks to this, we can't rely on the whole task existing in the same "instant" of stream-time. The solution is for each processor node that cares about stream-time to track it independently.

vvcephei · 2019-02-15T20:09:41Z

@ableegoldman @mjsax @bbejeck @guozhangwang ,

Please take a look at this fix for suppression.

As I mentioned in the description, it was incorrect for a processor to use the task's view of stream time, when a processor within a task may have an arbitrarily delayed local view of time, when it has an upstream record cache or suppression.

This was actually always true (since I introduced stream time into the context), but it didn't manifest as a bug until suppression, which depends very strictly on the stream time.

The correct thing to do is track observed stream time locally in each processor node. Since not every processor needs stream time, and since it was never in the ProcessorContext public API, I opted to remove it from the ProcessorContext and just explicitly implement it in the nodes that need it.

If we wish to expose "stream time" to processors in general, we can instead track stream time in ProcessorNode and then wrap the ProcessorContext to supply the local stream time when injecting the context to Processor.init. But I that can wait until the need arises. The current change is sufficient for now.

Ideally, we can merge this fix into 2.2 as well.

vvcephei · 2019-02-15T21:18:07Z

I guess it should be noted that the same condition could cause segments to be prematurely expired and records to be prematurely rejected from windowed aggregations due to grace period restrictions, if the windowed aggregation were downstream of a cache or suppression.

It just happens that suppression was the first thing that people noticed (probably due to the long default retentions and grace period).

vvcephei · 2019-02-16T00:44:45Z

Java 11 failure was a kafka admin test. Apparently, gradle crashed in the Java 8 build!

Retest this, please.

mjsax · 2019-02-16T00:52:47Z

https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/2402/

java.lang.AssertionError: The consumer group deletion did not work as expected
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at kafka.admin.DeleteConsumerGroupsTest.testDeleteCmdWithShortInitialization(DeleteConsumerGroupsTest.scala:211)

Java8 timed out (cf https://issues.apache.org/jira/browse/KAFKA-7933

mjsax · 2019-02-16T00:30:03Z

        private StreamsMetricsImpl metrics;
        private InternalProcessorContext internalProcessorContext;
        private Sensor lateRecordDropSensor;
+        private long observedStreamTime = -1L;


nit: use ConsumerRecord.NO_TIMESTAMP in stead of -1L (same below)

Hmm, should we use Long.MIN_VALUE instead, to account for negative timestamps?

We don't need a sentinel value, we just need something that's not larger than a real record's timestamp.

So far, we don't support negative timestamp, so it seems to be cleanest to use NO_TIMESTAMP for now -- if we ever introduce negative timestamps, we need to update the code anyway.

Sounds good. It's updated.

mjsax · 2019-02-16T00:32:20Z

        }
    };

-    private long streamTime = RecordQueue.UNKNOWN;


Can you fix RecordQueue.UNKNOWN as a side thing, too?

You mean, set it to ConsumerRecord.NO_TIMESTAMP, right?

mjsax · 2019-02-16T00:33:13Z

    @Override
-    public S getOrCreateSegmentIfLive(final long segmentId, final InternalProcessorContext context) {
-        final long minLiveTimestamp = context.streamTime() - retentionPeriod;
+    public S getOrCreateSegmentIfLive(final long segmentId, final InternalProcessorContext context, final long streamTime) {


nit: break parameters into single lines

mjsax · 2019-02-16T00:35:56Z

-                segment.write(batch);
+                if (segment.open) {
+                    segment.write(batch);
+                }


Can you elaborate? Seems to be orthogonal to the timestamp fix.

The segmented store expires segments based on stream time.

Previously, the stream time it used was coming from the ProcessorContext, which doesn't get updated while we're processing this batch write, so the whole method execution happened in one "instant". That instant would be at least the latest time in the batch. So when it loaded all the segments in building batchWriteMap, it was already able to tell which records were for expired segments and drop them.

Now, though, it tracks stream time based on the sequence of received writes, so at the beginning of this method call, the local stream time is maybe at the beginning of the batch. This means that in the beginning of this method, it creates/gets segments for everything in the batch (not updating the stream time at all), so all the segments are still live until we enter the loop over writeBatchMap.

When we loop over the batch map, we're actually processing the writes, which advances the local view of stream time, possibly expiring some segments. Since stream-time is actually advancing during the loop, and segments are being expired as the loop executes, we can no longer expect all our segments to be open here.

Thanks for the details. Should we instead do a "clean up" within getWriteBatches() and only return open segments? Seems to be a better separation of concerns.

Hah. Ok, that's what I get for commenting from memory. It turns out that we actually do advance time within getWriteBatches. The problem is actually that in that method, we're advancing time while we're also creating the segments and populating the result list. So, by the end of the iteration, some of the segments at the beginning of the result list may already be expired.

We can fix it by either pre-iterating over all the records to advance time to the largest timestamp in the batch or post-iterating over the results to remove the expired segments.

I opted for the pre-iteration so that we can potentially avoid creating and opening a segment that would just need to be closed and cleaned up later in the same batch.

The code is cleaned up now. Thanks for the catch!

mjsax · 2019-02-16T00:39:38Z


-        context.setStreamTime(Math.max(retention, segmentInterval) * 2);
-        bytesStore.put(serializeKey(new Windowed<>("a", windows[0])), serializeValue(5));
+        bytesStore.put(serializeKey(new Windowed<>("dummy", nextSegmentWindow)), serializeValue(0));


Is this to "set" a specific obserservedTime -- might need a comment

yes, it's to set the stream time to a high enough value that it expires the previous segment... I'll add a comment.

mjsax · 2019-02-16T00:41:37Z

        LogCaptureAppender.setClassLoggerToDebug(RocksDBSegmentedBytesStore.class);
        final LogCaptureAppender appender = LogCaptureAppender.createAndRegister();

-        context.setStreamTime(Math.max(retention, segmentInterval) * 2);


I can't map this to the nextSegmentWindow bounds above. Can you elaborate?

The new code is a tighter bound. Previously we used 2 * max(retention, interval), but all we really need is retention + interval. Not sure why I did the previous code, it's guaranteed to be at least retention + interval, but it's pretty esoteric...

vvcephei · 2019-02-16T02:44:30Z

Thanks for the review, @mjsax . I think I've addressed your comments.

bbejeck

Thanks for the patch @vvcephei. I've made one pass and I have one question regarding the test. The smoke test covers the scenario of a time window for a 24 hour period with a grace of one minute, then sends records way beyond the time bounds to force a flush.

That makes sense, but can we also test with tighter time bounds with windows of say 30 seconds and use system time without adding any extra time for verification?

vvcephei · 2019-02-16T04:03:25Z

I think that's a good idea, @bbejeck .

I'm thinking this could be an integration test, to avoid adding too much on to the smoke test, do you agree?

Actually, after wrestling with this for a bit, it's more complicated than I thought. What do you think about tackling that in a separate PR? I went ahead and added it to the acceptance criteria of KAFKA-7895. There's also a task there to add system test coverage with caching disabled, so this PR wouldn't have resolved the ticket anyway.

vvcephei · 2019-02-16T14:04:42Z

Failures unrelated:
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/19515/

kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
org.apache.kafka.common.errors.GroupMaxSizeReachedException: Consumer group group2 already has the configured maximum number of members.

https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/2414/

kafka.api.PlaintextConsumerTest.testConsumeMessagesWithCreateTime

java.lang.AssertionError: Partition [__consumer_offsets,0] metadata not propagated after 15000 ms

Retest this, please,

mjsax

LGTM (assuming Jenkins passes)

vvcephei · 2019-02-16T22:19:07Z

Java 11 had one unrelated failure:
https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/2424/

kafka.api.PlaintextConsumerTest.testMultiConsumerRoundRobinAssignment
java.lang.AssertionError: Did not get valid initial assignment for partitions [topic2-3, topic1-2, topic2-0, topic2-5, topic1-4, topic2-7, topic2-1, topic2-6, topic1-0, topic1-3, topic2-4, topic1-1, topic2-2]
	at kafka.utils.TestUtils$.fail(TestUtils.scala:356)
	at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:766)
	at kafka.api.PlaintextConsumerTest.validateGroupAssignment(PlaintextConsumerTest.scala:1772)
	at kafka.api.PlaintextConsumerTest.testMultiConsumerRoundRobinAssignment(PlaintextConsumerTest.scala:864)

vvcephei · 2019-02-17T01:27:04Z

That plaintext consumer test was the only failure. I'm not sure if there's value in repeating the tests at this point.

mjsax · 2019-02-17T02:59:41Z

https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/2424/

java.lang.AssertionError: Did not get valid initial assignment for partitions [topic2-3, topic1-2, topic2-0, topic2-5, topic1-4, topic2-7, topic2-1, topic2-6, topic1-0, topic1-3, topic2-4, topic1-1, topic2-2]
	at kafka.utils.TestUtils$.fail(TestUtils.scala:356)
	at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:766)
	at kafka.api.PlaintextConsumerTest.validateGroupAssignment(PlaintextConsumerTest.scala:1772)
	at kafka.api.PlaintextConsumerTest.testMultiConsumerRoundRobinAssignment(PlaintextConsumerTest.scala:864)

Retest this please.

vvcephei · 2019-02-17T19:31:02Z

Hey, it passed that time! Thanks, @mjsax .

bbejeck · 2019-02-18T00:54:24Z

I'm thinking this could be an integration test, to avoid adding too much on to the smoke test, do you agree?

Yes, I actually was thinking of an integration test and just wasn't being clear about it.

Actually, after wrestling with this for a bit, it's more complicated than I thought. What do you think about tackling that in a separate PR? I went ahead and added it to the acceptance criteria of KAFKA-7895. There's also a task there to add system test coverage with caching disabled, so this PR wouldn't have resolved the ticket anyway.

That's fine.

bbejeck

Thanks for the patch @vvcephei. Over all looks good I just wanted to ask a clarifying question.

Would it be fair to say the new approach "smooths" out the representation of time per processor, in that it can only ever increase or at a minimum stay the same?

The process of only updating the observedStreamTime when the internalProcessorContext#timestamp() is greater than the last observed time is more or less the same approach taken in the RecordQueue#maybeUpdateTimestamp() correct?

mbragg02 · 2019-02-18T13:51:12Z

Looks good, thanks @vvcephei

vvcephei · 2019-02-18T14:00:36Z

Thanks, @bbejeck ,

Yes, for the processors that track stream time, the definition of time is essentially the same as it is for the record queue. It's always the maximum record timestamp ever observed, so, as you say, it will always increase or stay the same.

It used to be true that (when there's one or more upstream caches or suppressions in the same task), we'd always see records arriving from the past. E.g., the clock would say 10:00, but I'd be getting records from 09:55. Now, we've made stream time tighter, so the clock will say 09:55 when we're getting records for 09:55.

Another change is that, for the processors tracking stream time, time only advances for records that actually flow through the processor, so we won't see the clock continuing to "tick" as a result of records that actually get filtered out before they get to us. I think this is good for predictability. That is, it's a simpler mental model, so it's easier to predict how the windowed aggs and suppression will behave wrt stream time.

I think it makes sense to say that this smoothes out the representation of stream time.

vvcephei · 2019-02-18T14:01:11Z

Thanks @mbragg02!

bbejeck

Thanks @vvcephei LGTM

mjsax · 2019-02-18T18:19:24Z

@vvcephei This PR has conflicts with 2.2 branch and I could not cherry-pick it cleanly. Can you please do a follow up PR against 2.2 for this? Thanks.

mjsax · 2019-02-18T18:20:37Z

Merged to trunk.

vvcephei · 2019-02-18T18:20:54Z

Thanks, @mjsax ! I will do so now.

* ak/trunk: (45 commits) KAFKA-7487: DumpLogSegments misreports offset mismatches (apache#5756) MINOR: improve JavaDocs about auto-repartitioning in Streams DSL (apache#6269) KAFKA-7935: UNSUPPORTED_COMPRESSION_TYPE if ReplicaManager.getLogConfig returns None (apache#6274) KAFKA-7895: Fix stream-time reckoning for suppress (apache#6278) KAFKA-6569: Move OffsetIndex/TimeIndex logger to companion object (apache#4586) MINOR: add log indicating the suppression time (apache#6260) MINOR: Make info logs for KafkaConsumer a bit more verbose (apache#6279) KAFKA-7758: Reuse KGroupedStream/KGroupedTable with named repartition topics (apache#6265) KAFKA-7884; Docs for message.format.version should display valid values (apache#6209) MINOR: Save failed test output to build output directory MINOR: add test for StreamsSmokeTestDriver (apache#6231) MINOR: Fix bugs identified by compiler warnings (apache#6258) KAFKA-6474: Rewrite tests to use new public TopologyTestDriver [part 4] (apache#5433) MINOR: fix bypasses in ChangeLogging stores (apache#6266) MINOR: Make MockClient#poll() more thread-safe (apache#5942) MINOR: drop dbAccessor reference on close (apache#6254) KAFKA-7811: Avoid unnecessary lock acquire when KafkaConsumer commits offsets (apache#6119) KAFKA-7916: Unify store wrapping code for clarity (apache#6255) MINOR: Add missing Alter Operation to Topic supported operations list in AclCommand KAFKA-7921: log at error level for missing source topic (apache#6262) ...

Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>

vvcephei and others added 2 commits February 15, 2019 12:37

Initialize Suppress's stream time to -1

81a7f40

mjsax added the streams label Feb 15, 2019

mjsax reviewed Feb 16, 2019

View reviewed changes

cr comments

4146d9d

bbejeck reviewed Feb 16, 2019

View reviewed changes

optimize segmented recovery

d6e840f

mjsax approved these changes Feb 16, 2019

View reviewed changes

bbejeck reviewed Feb 18, 2019

View reviewed changes

bbejeck approved these changes Feb 18, 2019

View reviewed changes

mjsax merged commit 0846f59 into apache:trunk Feb 18, 2019

vvcephei deleted the KAFKA-7895-fix-suppress branch February 18, 2019 18:20

vvcephei mentioned this pull request Feb 18, 2019

KAFKA-7895: fix stream-time reckoning for Suppress (2.2) #6286

Merged

3 tasks

pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019

KAFKA-7895: Fix stream-time reckoning for suppress (apache#6278)

b3a514e

Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>

Conversation

vvcephei commented Feb 15, 2019

Committer Checklist (excluded from commit message)

Uh oh!

vvcephei commented Feb 15, 2019

Uh oh!

vvcephei commented Feb 15, 2019

Uh oh!

vvcephei commented Feb 16, 2019

Uh oh!

mjsax commented Feb 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Feb 16, 2019

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Feb 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vvcephei commented Feb 16, 2019

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Feb 16, 2019

Uh oh!

vvcephei commented Feb 17, 2019

Uh oh!

mjsax commented Feb 17, 2019

Uh oh!

vvcephei commented Feb 17, 2019

Uh oh!

bbejeck commented Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

mbragg02 commented Feb 18, 2019

Uh oh!

vvcephei commented Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vvcephei commented Feb 18, 2019

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

vvcephei commented Feb 16, 2019 •

edited

Loading

bbejeck commented Feb 18, 2019 •

edited

Loading

vvcephei commented Feb 18, 2019 •

edited

Loading