Skip to content

Conversation

@davidnavas
Copy link

What changes were proposed in this pull request?

Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method).
Use this method to clear the subset of the BitSet which needs to be used during merge joins.

How was this patch tested?

dev/run-tests, as well as performance tests on skewed data as described in jira.

I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long). This is expected to be de-minimis and was not specifically tested.

@markhamstra
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #3263 has finished for PR 15084 at commit a88af25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markhamstra
Copy link
Contributor

@rxin @jegonzal looking for a third-party review

*/
def clearUntil(bitIndex: Int) {
val wordIndex = bitIndex >> 6 // divide by 64
var i = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.util.Arrays.fill can do this in one line. It won't be faster but could be a tiny bit cleaner. Up to your taste whether you want to change this and other occurrences in the file.

@srowen
Copy link
Member

srowen commented Sep 15, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #3267 has finished for PR 15084 at commit b3c7127.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #3268 has finished for PR 15084 at commit b3c7127.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #3269 has finished for PR 15084 at commit b3c7127.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #3273 has finished for PR 15084 at commit b3c7127.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def clearUntil(bitIndex: Int): Unit = {
val wordIndex = bitIndex >> 6 // divide by 64
Arrays.fill(words, 0, wordIndex, 0)
if(wordIndex < words.length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this should say "Clear the remaining bits" but I can fix that on merge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that was dumb. Sorry and thanks. I can update the PR in about seven hours if you don't merge it today 🤷


if (leftMatches.size <= leftMatched.capacity) {
leftMatched.clear()
leftMatched.clearUntil(leftMatches.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is a correct change as you describe it, but can you help me be 100% sure by describing why we know the bits above this point will be 0 or don't matter? I'm trying to think of a case where a leftover set 1 bit from previous computation causes a problem. Is that definitely not possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about off-by-one errors myself, which is why I added the tests, so I understand the concern.
The simplest argument is that only leftMatches.size bits are ever used (set or checked), otherwise the re-allocation side of the if would have needed to be larger. The default bitset is allocated with "1" as its capacity, so at pretty much anytime the join will have to either clear that one bit, or allocate more. leftMatches.size is the capacity used to re-allocate, so that's all that need to be cleared Same for rightMatches.
You can verify that assumption by seeing how this is used in scanNextInBuffered() -- leftMatched.get(leftIndex) where leftIndex (on that line) is always less than leftMatches.size. And you can see that leftMatches is only updated in that same findMatchingRows() method that clears the bitset. Also, leftMatches is private, so no funny business in subclasses. As far as guaranteeing correctness, you can arrange for the code to misbehave if either the leftKeyGenerator or the leftIter objects passed into the constructor calls advanceNext(), but such code would currently eventually throw an out-of-bounds exception in the cases where the bitset is not yet large enough.

Let me know if that argument isn't coherent, it's still a wee early over here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK seems reasonable, and tests pass.

@viirya
Copy link
Member

viirya commented Sep 17, 2016

LGTM

@srowen
Copy link
Member

srowen commented Sep 17, 2016

Merged to master

@asfgit asfgit closed this in 9dbd4b8 Sep 17, 2016
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…ge joins

## What changes were proposed in this pull request?

Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method).
Use this method to clear the subset of the BitSet which needs to be used during merge joins.

## How was this patch tested?

dev/run-tests, as well as performance tests on skewed data as described in jira.

I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long).  This is expected to be de-minimis and was not specifically tested.

Author: David Navas <davidn@clearstorydata.com>

Closes apache#15084 from davidnavas/bitSet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants