KAFKA-7652: Part I; Fix SessionStore's findSession(single-key)#6134
KAFKA-7652: Part I; Fix SessionStore's findSession(single-key)#6134guozhangwang merged 11 commits intoapache:trunkfrom
Conversation
…ssion-find t push origin
| final KeyValueIterator<Windowed<String>, Long> values = sessionStore.findSessions(key, 0, 1000L); | ||
| assertEquals(expected, toList(values)); | ||
|
|
||
| final List<KeyValue<Windowed<String>, Long>> expected2 = Collections.singletonList(KeyValue.pair(a2, 2L)); |
There was a problem hiding this comment.
This added test is to cover the discovered bug, it will fail if 1) is not fixed.
| final byte[] maxSuffix = ByteBuffer.allocate(SUFFIX_SIZE) | ||
| .putLong(to) | ||
| // start can at most be equal to end | ||
| // the end timestamp can be as large as possible as long as it's larger than start time |
There was a problem hiding this comment.
This is the fix for 2).
| @Override | ||
| public KeyValueIterator<Windowed<K>, AGG> findSessions(final K key, final long earliestSessionEndTime, final long latestSessionStartTime) { | ||
| return findSessions(key, key, earliestSessionEndTime, latestSessionStartTime); | ||
| final KeyValueIterator<Bytes, byte[]> bytesIterator = bytesStore.fetch( |
There was a problem hiding this comment.
This is the fix for 1).
There was a problem hiding this comment.
Maybe saving someone else from having to dig down and figure out why it's different...
Because of the way the keys are formatted, when we know we're looking for this exact key, we can search essentially from the key [my-key , start time] through [my-key , end time], whereas the other method we were delegating to has to scan from [start-key] to [end-key], which (because of the way the keys are structured) can't also bound the start and end time, but has to iterate over all the rocks keys and filter out the ones that fall outside the desired time bound.
Therefore, this change can significantly tighten the range that actually gets scanned in Rocks.
|
Here is the full story..
so let's take an example: let's say you have the following sessions for the same key: And we query
The key idea here is that, in order to support the note the ones with Note this is a conservative range, but still pretty efficient. And this is exactly what we define here:
3.a) that adds the key-range query as well (i.e. https://github.com/apache/kafka/pull/3027/files#diff-8d2f47dd4b3fa8aa87565fa1f5b01d3fR93 Note the line I highlighted, it just delegate 3.b) https://github.com/apache/kafka/pull/2972/files that fixed a bug in And then later we realized that for multiple-key range query, it is actually more complicated than we thought (we've discussed about this, details here: The end result is that we need to have a much, much, much more conservative range for multi-key ranges, and hence this: In fact, even this range turns out to be not conservative enough as I've investigated it (my PR have also fixed it). So this actually result in both the performance degradation, as mentioned in https://issues.apache.org/jira/browse/KAFKA-7652, but even worse, because the not-conservative-enough range of multi-keys, it is also introducing a correctness bug, as reported in the ML some time ago. Search for |
|
cc @twbecker who reported this issue before as well. |
|
Thanks for digging into this @guozhangwang and the detailed explanation. One thing I'm not following is that the performance issue in https://issues.apache.org/jira/browse/KAFKA-7652 showed us spending significantly more time iterating entries in the NamedCache. If I'm understanding this fix correctly it should reduce the number of entries that would need to be scanned through the storeIterator, not the number of entries in the cacheIterator. Is there something I'm missing? It will be great to test out with this performance improvement. |
|
@cwildman This is a question I was asking myself as well: from the code changes from 0.10.2.2 to 0.11.0.0, as listed above all the PRs does not have a direct impact on session store caching layer, or their impact should not be on the cache access frequency directly (e.g. https://github.com/apache/kafka/pull/2972/files or KIP-155). So the only thing I can conjecture is that, because of the bug hidden in underlying store which causes findSessions to miss some key entries, the aggregate processor is less effective in merging sessions into a single one, and hence resulted in more scattered sessions instead of fewer larger sessions. Overtime this would result in more fetches in the caching session as well (imagine you have lots of shorter sessions v.s. few longer sessions, the Without having the real workload to reproduce 7652 it is hard to tell, so my plan is to get a cherry-pick PR for 0.11.0 as well and let @jonathanpdx to try it out. |
|
Thanks for the fix @guozhangwang! |
|
@guozhangwang I just realized that the is a bug in This should be |
vvcephei
left a comment
There was a problem hiding this comment.
@guozhangwang , This is a great find!
I'm still reading all the related code, but I've left one comment explaining what I think is going on with the (1) fix. Maybe you can correct me if I've gotten it wrong.
| @Override | ||
| public KeyValueIterator<Windowed<K>, AGG> findSessions(final K key, final long earliestSessionEndTime, final long latestSessionStartTime) { | ||
| return findSessions(key, key, earliestSessionEndTime, latestSessionStartTime); | ||
| final KeyValueIterator<Bytes, byte[]> bytesIterator = bytesStore.fetch( |
There was a problem hiding this comment.
Maybe saving someone else from having to dig down and figure out why it's different...
Because of the way the keys are formatted, when we know we're looking for this exact key, we can search essentially from the key [my-key , start time] through [my-key , end time], whereas the other method we were delegating to has to scan from [start-key] to [end-key], which (because of the way the keys are structured) can't also bound the start and end time, but has to iterate over all the rocks keys and filter out the ones that fall outside the desired time bound.
Therefore, this change can significantly tighten the range that actually gets scanned in Rocks.
bbejeck
left a comment
There was a problem hiding this comment.
Thanks @guozhangwang I've made an initial pass, overall this looks good to me I've left just one comment/question.
What do you think about adding the description at the start of this PR documentation either as javadoc or in our docs?
| .putLong(to) | ||
| // start can at most be equal to end | ||
| // the end timestamp can be as large as possible as long as it's larger than start time | ||
| .putLong(Long.MAX_VALUE) |
There was a problem hiding this comment.
@guozhangwang as I'm reading this PR and getting my head around how this operates for my understanding I'm going to rephrase the issue that this line fixes.
The core issue is that by setting the upper range to (to, to) caused us to find an "endTime" earlier than it should, due to fact that we store session windows as endTime, startTime hence we ended up missing sessions to retrieve which ended up causing more issues down the line.
Is that a fair statement?
There was a problem hiding this comment.
That is right. Note that the names here are a bit confusing, since the semantics of the findSessions is earliestEndTime (named from here) and latestStartTime (named to here). And going back to my single-key range example in #6134 (comment) but extending a bit to multi-key range the current code use [latestStartTime, latestStartTime] as the max-suffix to apply with key-to. This is an issue if latestStartTime is smaller than earliestEndTime: note in our aggregation it would never be the case, but for other users it is okay to call, e.g. findSessions(100, 1) which would use [1,1] as the suffix with key-to which would be missing entries.
I'll adjust the existing unit test to expose this bug as well.
| .putLong(to) | ||
| // start can at most be equal to end | ||
| // the end timestamp can be as large as possible as long as it's larger than start time | ||
| .putLong(Long.MAX_VALUE) |
There was a problem hiding this comment.
Since OrderedBytes#upperRange chops off the key at the first key-byte that's smaller than the first suffix-byte, it seems like setting the "end timestamp" to max long would be counterproductive, since it'll result in chopping off the key at the first non-max-value byte? Maybe it would be better to use the first successor of the "to key"? Actually, we can consider just using https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/util/BytewiseComparator.java#L73
For example, if our search is:
key: 0xf000 0001 to: 0x0000 0000 0000 0000
=>
key: 0xf000 0001 suffix: 0x7fff ffff ffff ffff 0000 0000 0000 0000
Then, upperRange will trim our query to:
0xf 7fff ffff ffff ffff 0000 0000 0000 0000
which is way more permissive than the first successor to the key:
0xf000 0002
(It permits max_int - 2 = 2,147,483,645 more keys)
Please scrutinize my math, I might have made an error.
If there's no error, though, I think it might pay off to inline upperRange here, since the interactions between the methods are so important in determining what key we actually get.
| final List<Long> rangeResults = new ArrayList<>(); | ||
| while (rangeIterator.hasNext()) { | ||
| rangeResults.add(rangeIterator.next().value); | ||
| try (final KeyValueIterator<Windowed<String>, Long> iterator = |
There was a problem hiding this comment.
This is the added unit test that does a findSession where the first parameter is larger than the second. Without the fix 2) it will fail.
The rest of the changes are just to wrap Closeable and does not have any logical changes.
|
|
||
| assertThat(upper, equalTo(Bytes.wrap(SessionKeySchema.toBinary( | ||
| new Windowed<>(Bytes.wrap(new byte[]{0xA, 0xB, 0xC}), new SessionWindow(0, 0)))) | ||
| new Windowed<>(Bytes.wrap(new byte[]{0xA}), new SessionWindow(0, Long.MAX_VALUE)))) |
There was a problem hiding this comment.
This test need to be modified because of the change 2).
|
cc @ableegoldman as well. |
|
@vvcephei I've thought about the fix in 2) again, but cannot come up with a better upper bound than this very-conservative measure. LMK if you have better ideas. |
dguy
left a comment
There was a problem hiding this comment.
Thanks @guozhangwang LGTM subject to other peoples comments.
| verifyWindowedKeyValue(all.next(), new Windowed<>(keyAA, new SessionWindow(0, 0)), "1"); | ||
| verifyWindowedKeyValue(all.next(), new Windowed<>(keyB, new SessionWindow(0, 0)), "1"); | ||
| assertFalse(all.hasNext()); | ||
|
|
bbejeck
left a comment
There was a problem hiding this comment.
Thanks for the fix @guozhangwang LGTM
Let findSessions(final K key) to call on underlying bytes store directly, using the more restricted range. Fix the conservative upper range for multi-key range in session schema. Minor: removed unnecessary private WrappedSessionStoreBytesIterator class as it is only used in unit test. Minor: removed unnecessary schema#init function by using the direct bytes-to-binary function. Please read the original PR for more detailed explanation of the root cause of the bug. Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian@confluent.io>, John Roesler <john@confluent.io>
|
Cherry-picked to 2.1 as well. |
…e#6134) Let findSessions(final K key) to call on underlying bytes store directly, using the more restricted range. Fix the conservative upper range for multi-key range in session schema. Minor: removed unnecessary private WrappedSessionStoreBytesIterator class as it is only used in unit test. Minor: removed unnecessary schema#init function by using the direct bytes-to-binary function. Please read the original PR for more detailed explanation of the root cause of the bug. Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian@confluent.io>, John Roesler <john@confluent.io>
* ak/trunk: MINOR: fix race condition in KafkaStreamsTest (apache#6185) KAFKA-4850: Enable bloomfilters (apache#6012) MINOR: ducker-ak: add down -f, avoid using a terminal in ducker test KAFKA-5117: Stop resolving externalized configs in Connect REST API MINOR: Cleanup handling of mixed transactional/idempotent records (apache#6172) KAFKA-7844: Use regular subproject for generator to fix *All targets (apache#6182) Fix Documentation for cleanup.policy is out of date (apache#6181) MINOR: increase timeouts for KafkaStreamsTest (apache#6178) MINOR: Rejoin split ssl principal mapping rules (apache#6099) MINOR: Handle case where connector status endpoints returns 404 (apache#6176) MINOR: Remove unused imports, exceptions, and values (apache#6117) KAFKA-3522: Add internal RecordConverter interface (apache#6150) Fix Javadoc of KafkaConsumer (apache#6155) KAFKA-6455: Extend CacheFlushListener to forward timestamp (apache#6147) MINOR: Log partition info when creating new request batch in controller (apache#6145) KAFKA-7652: Part I; Fix SessionStore's findSession(single-key) (apache#6134) MINOR: Remove the InvalidTopicException handling in InternalTopicManager (apache#6167) [KAFKA-7024] Rocksdb state directory should be created before opening the DB (apache#6138) MINOR:: Fix typos (apache#6079)
…e#6134) Let findSessions(final K key) to call on underlying bytes store directly, using the more restricted range. Fix the conservative upper range for multi-key range in session schema. Minor: removed unnecessary private WrappedSessionStoreBytesIterator class as it is only used in unit test. Minor: removed unnecessary schema#init function by using the direct bytes-to-binary function. Please read the original PR for more detailed explanation of the root cause of the bug. Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian@confluent.io>, John Roesler <john@confluent.io>
Let
findSessions(final K key)to call on underlying bytes store directly, using the more restricted range.Fix the conservative upper range for multi-key range in session schema.
Minor: removed unnecessary private WrappedSessionStoreBytesIterator class as it is only used in unit test.
Minor: removed unnecessary schema#init function by using the direct bytes-to-binary function.
Committer Checklist (excluded from commit message)