-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17529][core] Implement BitSet.clearUntil and use it during merge joins #15084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
test this please |
|
Test build #3263 has finished for PR 15084 at commit
|
| */ | ||
| def clearUntil(bitIndex: Int) { | ||
| val wordIndex = bitIndex >> 6 // divide by 64 | ||
| var i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java.util.Arrays.fill can do this in one line. It won't be faster but could be a tiny bit cleaner. Up to your taste whether you want to change this and other occurrences in the file.
|
LGTM |
|
Test build #3267 has finished for PR 15084 at commit
|
|
Test build #3268 has finished for PR 15084 at commit
|
|
Test build #3269 has finished for PR 15084 at commit
|
|
Test build #3273 has finished for PR 15084 at commit
|
| def clearUntil(bitIndex: Int): Unit = { | ||
| val wordIndex = bitIndex >> 6 // divide by 64 | ||
| Arrays.fill(words, 0, wordIndex, 0) | ||
| if(wordIndex < words.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this should say "Clear the remaining bits" but I can fix that on merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that was dumb. Sorry and thanks. I can update the PR in about seven hours if you don't merge it today 🤷
|
|
||
| if (leftMatches.size <= leftMatched.capacity) { | ||
| leftMatched.clear() | ||
| leftMatched.clearUntil(leftMatches.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is a correct change as you describe it, but can you help me be 100% sure by describing why we know the bits above this point will be 0 or don't matter? I'm trying to think of a case where a leftover set 1 bit from previous computation causes a problem. Is that definitely not possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about off-by-one errors myself, which is why I added the tests, so I understand the concern.
The simplest argument is that only leftMatches.size bits are ever used (set or checked), otherwise the re-allocation side of the if would have needed to be larger. The default bitset is allocated with "1" as its capacity, so at pretty much anytime the join will have to either clear that one bit, or allocate more. leftMatches.size is the capacity used to re-allocate, so that's all that need to be cleared Same for rightMatches.
You can verify that assumption by seeing how this is used in scanNextInBuffered() -- leftMatched.get(leftIndex) where leftIndex (on that line) is always less than leftMatches.size. And you can see that leftMatches is only updated in that same findMatchingRows() method that clears the bitset. Also, leftMatches is private, so no funny business in subclasses. As far as guaranteeing correctness, you can arrange for the code to misbehave if either the leftKeyGenerator or the leftIter objects passed into the constructor calls advanceNext(), but such code would currently eventually throw an out-of-bounds exception in the cases where the bitset is not yet large enough.
Let me know if that argument isn't coherent, it's still a wee early over here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK seems reasonable, and tests pass.
|
LGTM |
|
Merged to master |
…ge joins ## What changes were proposed in this pull request? Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method). Use this method to clear the subset of the BitSet which needs to be used during merge joins. ## How was this patch tested? dev/run-tests, as well as performance tests on skewed data as described in jira. I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long). This is expected to be de-minimis and was not specifically tested. Author: David Navas <davidn@clearstorydata.com> Closes apache#15084 from davidnavas/bitSet.
What changes were proposed in this pull request?
Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method).
Use this method to clear the subset of the BitSet which needs to be used during merge joins.
How was this patch tested?
dev/run-tests, as well as performance tests on skewed data as described in jira.
I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long). This is expected to be de-minimis and was not specifically tested.