ARROW-7072: [Java] Support concating validity bits efficiently #5782

liyafan82 · 2019-11-06T10:56:46Z

For scenarios when we need to concate vectors (like the scenario in ARROW-7048, and delta dictionary), we need a way to concat validity bits.

Currently, we have bit level API to read/write individual validity bit. However, it is not efficient , and we need a way to copy more bits at a time.

github-actions · 2019-11-06T11:04:00Z

https://issues.apache.org/jira/browse/ARROW-7072

emkornfield · 2019-11-12T07:12:11Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+    }
+    output.setZero(numBytes1, numBytesOut);
+
+    if (numBits1 % 8 == 0) {


use a mask? I think we have a utility method for this?

Good point. We have bitIndex utility method.

emkornfield · 2019-11-12T07:12:43Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+    }
+    output.setZero(numBytes1, numBytesOut);
+
+    if (numBits1 % 8 == 0) {


please document what this case represents.

Good point. Document added.

emkornfield · 2019-11-12T07:13:18Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+    }
+
+    // the number of bits to fill a full byte after the first input is processed
+    int numBitsToFill = 8 - (numBits1 % 8);


mask instead of mod, don't we have a utility method for this?

Revised. Thanks.

emkornfield · 2019-11-12T07:15:00Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+      byte curByte = input2.getByte(i);
+
+      // first fill the bits to a full byte
+      int byteToFill = (curByte & 0xff) << (8 - numBitsToFill);


is the mask here necessary, isn't it already a byte?

we must make sure the 24 high bits in the int is zero.
Without the mask, the high bits will be 1, if the highest bits in the curByte happen to be 1.

emkornfield · 2019-11-12T07:17:00Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+      byte curByte = input2.getByte(i);
+
+      // first fill the bits to a full byte
+      int byteToFill = (curByte & 0xff) << (8 - numBitsToFill);


is upcast to an int necessary?

According to the java language specification, a byte will be automatically promoted to an int, and all computations are performed as int. So we cast it to int, and store variables as int, to avoid unnecessary cast.

For details, please see:
https://stackoverflow.com/questions/27582233/why-byte-and-short-values-are-promoted-to-int-when-an-expression-is-evaluated

emkornfield · 2019-11-12T07:17:27Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+
+      // fill remaining bits in the current byte
+      int remByte = (curByte & 0xff) >>> numBitsToFill;
+      output.setByte(numBytes1 + i, remByte);


it seems like you could avoid one memory access, by keeping this byte cached instead of accessing writing it here and then reading it back again at the beginning of the loop.

Good point. Revised accordingly. Thank you.

it still seems like this is setting memory twice per loop?

Sorry I misunderstood your point at first. I have reivsed the code accordingly. Thanks for the great suggestion.

emkornfield · 2019-11-12T07:17:58Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+    int numBitsToFill = 8 - (numBits1 % 8);
+
+    // the number of extra bits for the second input, relative to full bytes
+    int numRemainingBits = numBits2 % 8;


move this closer to where it is used.

Good point. Thanks.

emkornfield · 2019-11-12T07:19:41Z

java/vector/src/test/java/org/apache/arrow/vector/TestBitVectorHelper.java

+        }
+      }
+
+      try (ArrowBuf buf1 = allocator.buffer(1024);


please have a separate test case of each block.

Sure. I have revised the test case, so this problem no longer exists.

emkornfield · 2019-11-12T07:20:58Z

java/vector/src/test/java/org/apache/arrow/vector/TestBitVectorHelper.java

+        buf1.setZero(0, buf1.capacity());
+        buf2.setZero(0, buf2.capacity());
+
+        final int count1 = 100;


these numbers seem arbitray it might be more obvious if you set bit patterns using hex or if you commented on the expected bit patterns instead of having loops.

Good suggestion. I have revised the test cases and give comments about the purpose of each case explicitly.

emkornfield · 2019-11-12T07:22:48Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+    if (input1 != output) {
+      PlatformDependent.copyMemory(input1.memoryAddress(), output.memoryAddress(), numBytes1);
+    }
+    output.setZero(numBytes1, numBytesOut);


this seems redundant except for the last byte if there is a remainder.

Good point. This is removed, and we pay special attention to set the last byte.

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

emkornfield · 2019-11-14T07:18:53Z

java/vector/src/test/java/org/apache/arrow/vector/TestBitVectorHelper.java

+    BitVectorHelper.concatBits(buf1, count1, buf2, count2, output);
+    for (int i = 0; i < count1 + count2; i++) {
+      int result = BitVectorHelper.get(output, i);
+      if (i < count1) {


I think it would be clearer and less reliant on the input setup to write this as:

outputIdx =0;
for (int i = 0; i < count1; i++, outputIdx++) {
assertEquals(BitVectorHelper.get(output, outputIdx), BitVectorHelper.get(buf1, i))
}
for (int i = 0; i < count2; i++, outputIdx++) {
assertEquals(BitVectorHelper.get(output, outputIdx), BitVectorHelper.get(buf2, i))
}

with this setup, for instance you could "fuzz" by generating random bit strings in addition to the set pattern here.

Good suggestion. Revised accordingly. Thank you.

emkornfield · 2019-11-21T05:48:27Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+   * @param numBits1 the number of bits in the first validity buffer.
+   * @param input2 the second validity buffer.
+   * @param numBits2 the number of bits in the second validity buffer.
+   * @param output the ouput validity buffer. It can be the same one as the first input.


this needs to be preallocated?

Sure. I have stated this explicitly in the JavaDoc.

emkornfield · 2019-11-21T05:52:52Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+      prevByte = curByte >>> numBitsToFill;
+    }
+
+    // clear high bits for the previous byte, as it may be the last byte


This comment doesn't seem to make sense anymore?

Nice catch. Thank you.

emkornfield · 2019-11-21T06:02:07Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+    int lastOutputByte = prevByte;
+
+    // the number of extra bits for the second input, relative to full bytes
+    int numRemainingBits = bitIndex(numBits2);


numTrailingBits I think would be a better name.

It looks better to me too. Thank you.

emkornfield · 2019-11-21T06:04:06Z

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java

+      int byteToFill = remByte << (8 - numBitsToFill);
+      lastOutputByte |= byteToFill;
+
+      if (numRemainingBits > numBitsToFill) {


i think the logic would be clearer if you moved this if statement. as the last piece of logic after output.setByte on line 397. Otherwise it seems strange be setting the non-ending byte as the last byte.

I agree with you that it is weird that last two byte are written in reverse order.

I have revised the code to solve this problem. Please take a look. Thanks a lot.

emkornfield

mostly looks good a few more comments to talk through then I think this can be merged.

emkornfield · 2019-11-22T05:51:35Z

+1 thank you.

For scenarios when we need to concate vectors (like the scenario in ARROW-7048, and delta dictionary), we need a way to concat validity bits. Currently, we have bit level API to read/write individual validity bit. However, it is not efficient , and we need a way to copy more bits at a time. Closes apache#5782 from liyafan82/fly_1106_bits and squashes the following commits: 70c1173 <liyafan82> Resolve comments d7bcfaa <liyafan82> Further improve the performance 4f1b9af <liyafan82> Resolve comments 591cf74 <liyafan82> Support concating validity bits efficiently Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

[ARROW-7072][Java] Support concating validity bits efficiently

591cf74

emkornfield added the Component: Java label Nov 8, 2019

emkornfield requested changes Nov 12, 2019

View reviewed changes

emkornfield reviewed Nov 12, 2019

View reviewed changes

java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java Outdated Show resolved Hide resolved

[ARROW-7072][Java] Resolve comments

4f1b9af

emkornfield reviewed Nov 14, 2019

View reviewed changes

[ARROW-7072][Java] Further improve the performance

d7bcfaa

emkornfield reviewed Nov 21, 2019

View reviewed changes

emkornfield requested changes Nov 21, 2019

View reviewed changes

[ARROW-7072][Java] Resolve comments

70c1173

emkornfield closed this in 74fa956 Nov 22, 2019

asfimport mentioned this pull request Nov 22, 2019

[Java] Support concating validity bits efficiently #23381

Closed

ARROW-7072: [Java] Support concating validity bits efficiently #5782

ARROW-7072: [Java] Support concating validity bits efficiently #5782

Uh oh!

Conversation

liyafan82 commented Nov 6, 2019

Uh oh!

github-actions bot commented Nov 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyafan82 Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

emkornfield Nov 14, 2019 •

edited

Loading

liyafan82 Nov 15, 2019 •

edited

Loading