KAFKA-6455: Session Aggregation should use window-end-time as record timestamp by mjsax · Pull Request #6645 · apache/kafka

mjsax · 2019-04-28T12:06:38Z

For session-windows, the result record should have the window-end timestamp as record timestamp.

mjsax · 2019-04-28T12:08:37Z

Replacing Stores.sessionStoreBuilder with a call to new SessionStoreBuilder to be able to set this newly introduced flag to true. It's not super clean, but I did not have a better idea to tackle this (more details below).

IIUC, you do not want to expose the flag in the public API. To make it super clean, you would need an internal factory method somewhere that is called here. Also the public factory method in Stores would then need to call the internal factory method with the flag set to false.

It's not exposed. Stores does set the flag hardcoded to false and SessionStoreBuilder is not part of public API. Still, it seems not clean for internal code. It basically leaks DSL into PAPI, but stores should be DSL agnostic...

FWIW, I think calling a constructor is just as clean as calling a static factory method. The flag itself is a little suspicious (although it may be unavoidable), but (IMO) calling the constructor is uncontroversial.

I thinks @mjsax 's original comment is regarding about passing this flag to the store builder, and hence to the caching store, whether it is the most elegant way.

I'm actually wondering if it is really necessary -- see my other comment below.

mjsax · 2019-04-28T12:09:03Z

Need this for testing

mjsax · 2019-04-28T12:09:35Z

By default, the new flag is to false for backward compatibility.

mjsax · 2019-04-28T12:10:49Z

We switch to the new semantics only if the SessionStore is used by the DSL, to preserve backward compatibility.

mjsax · 2019-04-28T12:11:37Z

Update the test to also compare the result record timestamp

mjsax · 2019-04-28T12:12:30Z

If we merge two sessions, we use the session-end-timestamp on delete for the smaller session now.

Hmm... Does this say that k1@0/0 was both created and deleted at time 0?

Maybe -- I was not really sure about this one. We never discussed how deletes should be handled. If we don't use the session-window end-timestamp, it seem we might reintroduce non-determinism -- not sure.

This is a very tricky subject, but it seems like the deletes should "happen" at the same time as the update. This would be the window-end time of the final merged window. In that case, we should actually not do any mutation while we're merging, but instead collect all the stuff to delete, and delete it at the end, while we also issue the update. (Since we wouldn't know the timestamp to use until after the merging is done).

Actually, this would also let us fix https://issues.apache.org/jira/browse/KAFKA-8318 , and opens up the possibility of doing a bulk/batch update to the state store.

Not sure if I understand the connection to 8318? Why do we need the merged-windowed end-timestamp to fix it? Also not sure how bulk/batch updates related?

Also curious was others think. \cc @guozhangwang @bbejeck @ableegoldman @cadonna @abbccdda

Hmm, I'm going to be a PITA and change my mind on this. Sorry, @bbejeck ...

After some further reflection, here's what I'm thinking:
Session windows have this lifecycle: creation, multiple updates, and deletion. Creation has always taken place effectively at window-end time, since when a window is created, start==end==record.timestamp.

Previously updates to a window were just assigned the timestamp of the event that caused the update. Let's say you have a count aggregation, and you have some window [0,5] with a count of 2, and you get some more input events at times 5, 3, 4, then your sequence of result updates are 3 at time 5, 4 at time 3, and 5 at time 4. This is semantically problematic, because the timestamps tell you these results are "out of order" and that the "most recent" count is actually 3 :( .

What @mjsax is proposing here is to "pin" all updates to a window to window-end time. Then, the result updates for our example is just 3 at time 5, 4 at time 5, 5 at time 5. This is totally fine, since, in the case of identical timestamps, offset order is the tie-breaker. Therefore, the "most recent" count is (correctly) 5.

It's worth noting that in general, we get an equally correct sequence of window updates as long as the update times don't go backwards. Just save this thought for a minute.

Now, we come to deletes. Arguably, the delete is just another kind of update. I think this is where Matthias's head was originally at. The same logic applies, if the delete timestamp is equal to the update timestamps and the create timestamp, then offset order is the tie-breaker. Since we delete the window after the creation and updates, the final state of the window is (correctly) "deleted".

Unlike the create/updates, though, the delete is "caused" by an event with a timestamp after the window end time. This is what was tripping me up. It seems fine to report a window update time as the "high watermark" time of all the window updates so far (in the case of disordered events), but it seems weird to report a window update (in this case, the delete) time as "in the past", from the perspective of the event that caused it. That's why I was thinking that we should use the later timestamp, to indicate that the delete was caused at that later time. This is also correct from a time semantics POV, because it preserves the correct order, that the delete comes the create/updates.

So, this is the punchline, that both approaches result in correct time semantics. The only difference is that using the causing-event timestamp for deletes reflects the provenance of the delete more accurately, but I don't think this fact is actually useful for anything. Given that we're already pinning the create and update timestamps to all be the window-end time, I'm thinking we should go ahead and stick with Matthias's original proposal to go ahead and use the same timestamp for the "final" update (aka, the delete).

@vvcephei Thanks for the detailed analysis! I think I agree with you and @mjsax on the approach, just to clarify my understanding further:

Today even if we are not merging two session windows, a single session window's update is treated as a delete followed by an update. I think is what https://issues.apache.org/jira/browse/KAFKA-8318 is reporting about..

Now the logic would become that we use the max(ts, window_end_time) for updates, hence:

with an update whose ts is smaller than the current end-window AND larger than the current start-window, the window_start/window_end_time would not change in the update record as well as in the changelog. In this case, we can consider optimizing it by not doing the delete followed by an update (i.e. KAFKA-8318).

** but practically, with rare out-of-ordering data this would probably give very small perf boost, right?

with an update whose ts is larger than current window end time, OR smaller than the window start time, we would update it as an delete of the original record (hence the tombstone ts == the old end-time) and then followed by a put of the new record with the new start / end-time == this record's ts.

merging two windows is actually equal to updating the smaller window with a larger end time and updating the larger window with a smaller start time (of course, with a single record).

Is my understanding correct?

Hey @guozhangwang , just getting back to this thread...

For 1, yes, I think this is the situation, and I agree with your conclusion (under the assumption that out-of-order data is actually rare)

For 2, almost... for "followed by a put of the new record with the new start / end-time == this record's ts", do you mean the new end-time of the window? That's what we will use as the update timestamp. Note, the new end-time of the window might be the same as the old one, which brings us back to KAFKA-8318.

For 3, I'm afraid I don't follow. Looking at the sequence of updates, it's equivalent to deleting both the original windows and then adding a new one that spans both (and is semantically the merge result). Is this what you meant?

Sound correct. Only (3 - merging two window) should be two deletes (of the old windows) and one insert for the new merged window with [first-window-start-ts,second-window-end-ts].

** but practically, with rare out-of-ordering data this would probably give very small perf boost, right?

It's not about performance, but just annoying to see unnecessary tombstones IMHO.

Yeah for 3) I mean the same as you guys, i.e. we are updating two windows by deleting the old records, but with only a new record. So logically it was like "new window replace old window1" and also "new window replace old window2".

For 2), the end/start-time of the window will only be the same as the old one if the update record's ts < end-time and ts > start-time. I think we are on the same page @vvcephei

mjsax · 2019-04-28T12:15:43Z

Call for review @guozhangwang @bbejeck @vvcephei @ableegoldman @cadonna @abbccdda

I am not 100% sure if the test coverage is sufficient. Please let me know if you think that more test are required.

mjsax · 2019-04-29T06:58:00Z

Created ticket for failing test. Java11 passed. Retest this please.

cadonna · 2019-04-29T09:30:15Z

IIUC, you do not want to expose the flag in the public API. To make it super clean, you would need an internal factory method somewhere that is called here. Also the public factory method in Stores would then need to call the internal factory method with the flag set to false.

cadonna · 2019-04-29T10:55:01Z

If you set the flag to true here, you only test the DSL mode of the store, right? The processor API mode is not tested. I think you should test both modes. You could verify the expected timestamps for each mode in two separate test methods with each method instantiating the respective store. For all other expected results you could use the same test methods and run them twice, once with a store in DSL mode and once with the store in Processor API mode.

abbccdda · 2019-04-30T06:41:01Z

Could we use if/else if/else for this part of the logic? It could be more reader-friendly.

Not sure what you mean? Btw this code is generated by IntelliJ

abbccdda · 2019-04-30T06:41:33Z

Why we change this?

I think it's improves the style. It's internal only anyway.

abbccdda · 2019-04-30T06:43:42Z

Potentially better to put this after L59 because dslUsage is ranked later of cacheFuncion in parameter list, and later of segmentInterval in argument list.

abbccdda · 2019-04-30T06:47:07Z

Shall we reuse entry.entry().context() on L94 here, instead of calling it twice for timestamp()?

mjsax · 2019-05-01T23:33:37Z

Updated this.

mjsax · 2019-05-06T11:30:14Z

Rebase this to resolve merge conflicts

vvcephei

Thanks, @mjsax !

vvcephei · 2019-05-06T19:45:06Z

FWIW, I think calling a constructor is just as clean as calling a static factory method. The flag itself is a little suspicious (although it may be unavoidable), but (IMO) calling the constructor is uncontroversial.

vvcephei · 2019-05-06T19:47:28Z

This is unsafe with protected, non-final fields. Can you just throw an UnsupportedOperationException instead, since you really just need equals for testing?

Needed to add this back, because we use it for testing now... Thoughts?

We use To in hash-collections in our tests? I couldn't find where that happens.

We don't use it explicitly, but it's required for mocking. Eg. https://github.com/apache/kafka/pull/6645/files#diff-2eb683696aa96820098ed11941833ee3R36

Thanks for the context. I think that mock actually only depends on Equals. At least, I replaced this with throw new UnsupportedOperationException();, and the test still passes for me.

Note that it would be equally safe (and maybe a little more intuitive) to just make the fields final. This results in a bigger change, though, because it needs changes in ProcessorContextImpl. That might not be a bad thing.

It seems like the To class is mutable primarily so that ProcessorContextImpl can maintain an immutable reference to it. But there's really no increase in safety between an immutable reference to a mutable object vs a mutable reference to an immutable object. Arguably, the latter is a little better because To is a data container (so immutability is best), whereas ProcessorContextImpl is a full-blown behavioral object that already has a bunch of mutable references to things.

Ack about throwing an exception.

I am open to the other refactoring, but I won't do it in this PR. Feel free to do a MINOR PR directly or create a ticket. I just don't want to convolute this PR too much. Is this ok with you?

Absolutely, I was just offering an alternative to the exception.

vvcephei · 2019-05-06T19:52:46Z

dslUsage doesn't seem like the right name for this. It seems like "what it does" is more important than "what it's for". In this case, it controls whether the window-end timestamp is used when flushing, instead of the last update timestamp. Maybe forwardWindowEndTimestamp?

vvcephei · 2019-05-06T19:54:52Z

should we also set the context timestamp to bytesKey.window().end()? It's a little confusing, since the flushed data seems to be partly influenced by the context, and partly influenced by the arguments to flushListener.

I see what you are saying. Atm, it does not make a difference, because the FlushListener calls:

context.forward(key, new Change<>(newValue, oldValue), To.all().withTimestamp(timestamp));

and this will set the context timestamp to the window-end timestamp before calling downstream processors.

As a matter of fact, I have the suspicion that we can actually remove the timestamp argument on the FlushListener -- it was added as part of KIP-258, but after some refactoring of our stores, I think we can change it back (my idea was to hold back until KIP-258 is finished). For this case, the problem resolves naturally. Would it be ok if we clean this up as a follow up (to be sure we can remove the timestamp parameter again)?

Would it be ok if we clean this up as a follow up (to be sure we can remove the timestamp parameter again)?

Of course! Thanks for looking at it.

vvcephei · 2019-05-06T19:59:39Z

Hmm... Does this say that k1@0/0 was both created and deleted at time 0?

mjsax · 2019-05-06T22:35:33Z

Updated this.

mjsax · 2019-05-07T10:35:28Z

Java11 failed. Tracked in Jira. Java8 passed.

Retest this please.

guozhangwang

I've had a meta question about the boolean flag, otherwise, lgtm.

guozhangwang · 2019-05-07T22:55:33Z

I'm wondering if we have to pass in this boolean flag to CachingSessionStore or not; my understanding is that for caching session stores, whenever it has a flush listener and that listener is called, we should always use the window end timestamp, right?

For the DSL yes, but not for the general case. PAPI uses should be allowed to define their own semantics IMHO.

If a PAPI user 1) adds a store, and then 2) cast that store to a CachedStateStore and call setFlushListener, then when that listener is called, the timestamp is passed in and is not be controllable by the user still, right? Or how the users could specify which timestamp to pass in today?

Thinking about this once more, I actually believe we don't need this flag and we can push all the logic into the flush-listener. Let me update this PR and we can discuss afterwards.

guozhangwang · 2019-05-07T22:56:37Z

I thinks @mjsax 's original comment is regarding about passing this flag to the store builder, and hence to the caching store, whether it is the most elegant way.

I'm actually wondering if it is really necessary -- see my other comment below.

mjsax · 2019-05-10T11:05:12Z

This class must be immutable -- otherwise, using context.forward(..., To.) in combination with suppress() breaks, because suppress() buffers a reference to the context and assumes it's immutable.

mjsax · 2019-05-10T11:05:53Z

Add a test for out-of-order data to make sure we set the correct timestamp

mjsax · 2019-05-10T11:06:16Z

Add test for out-of-order data to make sure we set the correct timestamp

mjsax · 2019-05-10T11:06:36Z

This class did not have any testing.

Ayayay. Thanks for adding it!

mjsax · 2019-05-10T11:07:21Z

Updating this test to also check the timestamp -- also add out-of-order record to the test case.

mjsax · 2019-05-10T11:08:59Z

Extend this test for caching and non-caching (to cover SessionTupleForwarder and SessionCachFlushListener code path).

Also update this test to check the result timestamps

mjsax · 2019-05-10T11:10:05Z

Update this test to check result timestamp, too.

mjsax · 2019-05-10T11:13:37Z

Refactored this, and remove the "annoying" flag from the builder classes. CachingSessionStore is not agnostic to the semantics as it should be. Extracted the logic to set the timestamp into SessionTupleForwarder and SessionCacheFlushListener instead.

Also updated couple of test cases -- one test case exposes a bug about mutable RecordContext -- I think, we don't need to backport this fix, because it's only a problem is context.forward(..., To) in combination with suppress() is used, what is not the case in older versions.

vvcephei

Thanks for the update. Nice work on getting rid of that flag!

vvcephei · 2019-05-10T14:27:31Z

This class is unsafe for hashing, because Headers is mutable. Do you need to store it in a hash-collection?

Ah. I forgot about headers. Don't think we need it. Will revert it.

vvcephei · 2019-05-10T14:29:10Z

With the new flusher/forwarder, do we need this still, or can we just roll this part of the diff back completely?

vvcephei · 2019-05-10T14:30:22Z

We use To in hash-collections in our tests? I couldn't find where that happens.

vvcephei · 2019-05-10T14:34:47Z

Bumping this conversation...

bbejeck

Chimed in on the question of delete timestamps, otherwise LGTM.

bbejeck · 2019-05-10T15:30:07Z

I inclined to agree that it seems the delete should happen at the same time as the update, meaning that we use the timestamp when the delete action occurs, but I could be wrong.

bbejeck · 2019-05-10T15:34:23Z

failures seem related

mjsax · 2019-05-10T18:24:26Z

Ack. Race condition... forgot to increase the countDownLatch counter...

mjsax · 2019-05-10T18:26:09Z

Updates this.

vvcephei

Thanks!

mjsax · 2019-05-10T19:42:31Z

@vvcephei I put this fix to make ProcessorRecordContext immutable -- however, after you comment about headers being mutable, I am wondering if this is an issue we need to address here or not. Maybe not, because it's a general issue and the users responsibility to deep-copy headers is they are modified. Just wanted to double check and point it out.

mjsax · 2019-05-10T22:06:16Z

Ok. If a test does not pass, this method is actually called. Hence., if we don't implement it, we don't get a "yellow" test failure with proper error message, but the test crashed "red". \cc @vvcephei

mjsax · 2019-05-11T12:58:20Z

Updated this with some minor cleanups..

mjsax · 2019-05-11T12:59:59Z

Similar to #6667 -- we should obey sendOldValues.

mjsax · 2019-05-11T13:00:07Z

mjsax · 2019-05-11T13:00:52Z

Some test cleanup to get rid or ResultCollector processor but reuse MockProcessor instead

guozhangwang

Made another pass on the latest three commits, and it LGTM.

bbejeck

Took a look at latest commits, LGTM

…time

mjsax · 2019-05-12T13:30:57Z

Rebased to resolve merge conflicts. Removed unused classes TupleForwarder and ForwardingCacheFlushListener (replace with TimestampedTupleForwarder, SessionTupleForwarder, TimestampedCacheFlushListerner, and SessionCacheFlushListener)

guozhangwang · 2019-05-12T22:31:56Z

Merged to trunk. Thanks @mjsax

…es-14-May * AK_REPO/trunk: (24 commits) KAFKA-7321: Add a Maximum Log Compaction Lag (KIP-354) (apache#6009) KAFKA-8335; Clean empty batches when sequence numbers are reused (apache#6715) KAFKA-6455: Session Aggregation should use window-end-time as record timestamp (apache#6645) KAFKA-6521: Use timestamped stores for KTables (apache#6667) [MINOR] Consolidate in-memory/rocksdb unit tests for window & session store (apache#6677) MINOR: Include StickyAssignor in system tests (apache#5223) KAFKA-7633: Allow Kafka Connect to access internal topics without cluster ACLs (apache#5918) MINOR: Align KTableAgg and KTableReduce (apache#6712) MINOR: Fix code section formatting in TROGDOR.md (apache#6720) MINOR: Remove unnecessary OptionParser#accepts method call from PreferredReplicaLeaderElectionCommand (apache#6710) KAFKA-8352 : Fix Connect System test failure 404 Not Found (apache#6713) KAFKA-8348: Fix KafkaStreams JavaDocs (apache#6707) MINOR: Add missing option for running vagrant-up.sh with AWS to vagrant/README.md KAFKA-8344; Fix vagrant-up.sh to work with AWS properly MINOR: docs typo in '--zookeeper myhost:2181--execute' MINOR: Remove header and key/value converter config value logging (apache#6660) KAFKA-8231: Expansion of ConnectClusterState interface (apache#6584) KAFKA-8324: Add close() method to RocksDBConfigSetter (apache#6697) KAFKA-6789; Handle retriable group errors in AdminClient API (apache#5578) KAFKA-8332: Refactor ImplicitLinkedHashSet to avoid losing ordering when converting to Scala ...

…timestamp (apache#6645) For session-windows, the result record should have the window-end timestamp as record timestamp. Rebased to resolve merge conflicts. Removed unused classes TupleForwarder and ForwardingCacheFlushListener (replace with TimestampedTupleForwarder, SessionTupleForwarder, TimestampedCacheFlushListerner, and SessionCacheFlushListener) Reviewers: John Roesler <john@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Boyang Chen <boyang@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

mjsax added the streams label Apr 28, 2019

mjsax commented Apr 28, 2019

View reviewed changes

Comment thread streams/src/main/java/org/apache/kafka/streams/processor/To.java Outdated

Copy link
Copy Markdown

Member Author

mjsax Apr 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need this for testing

mjsax commented Apr 28, 2019

View reviewed changes

cadonna reviewed Apr 29, 2019

View reviewed changes

abbccdda reviewed Apr 30, 2019

View reviewed changes

mjsax force-pushed the kafka-6455-improve-session-window-ts branch from c707130 to e676ed0 Compare May 6, 2019 11:29

vvcephei reviewed May 6, 2019

View reviewed changes

guozhangwang reviewed May 7, 2019

View reviewed changes

mjsax commented May 10, 2019

View reviewed changes

vvcephei requested changes May 10, 2019

View reviewed changes

bbejeck reviewed May 10, 2019

View reviewed changes

vvcephei approved these changes May 10, 2019

View reviewed changes

mjsax commented May 10, 2019

View reviewed changes

mjsax force-pushed the kafka-6455-improve-session-window-ts branch from e01a98e to aebb1cd Compare May 11, 2019 12:59

mjsax commented May 11, 2019

View reviewed changes

guozhangwang approved these changes May 11, 2019

View reviewed changes

bbejeck approved these changes May 11, 2019

View reviewed changes

mjsax added 2 commits May 12, 2019 13:06

KAFKA-6455: Session Aggregation should use window-end-time as record …

7caf7a0

…time

Removed unused classes after rebase

4387329

mjsax force-pushed the kafka-6455-improve-session-window-ts branch from aebb1cd to 4387329 Compare May 12, 2019 13:26

guozhangwang merged commit 8a237f5 into apache:trunk May 12, 2019

mjsax deleted the kafka-6455-improve-session-window-ts branch May 13, 2019 23:52

mjsax added the kip Requires or implements a KIP label Jun 12, 2020

Conversation

mjsax commented Apr 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Apr 28, 2019

Uh oh!

mjsax commented Apr 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented May 1, 2019

Uh oh!

mjsax commented May 6, 2019

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei May 10, 2019 •

edited

Loading