-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-633/634: [Java] Add FixedSizeBinary support in Java and integration tests (Updated) #1492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| public abstract class BaseFixedWidthVector extends BaseValueVector | ||
| implements FixedWidthVector, FieldVector, VectorDefinitionSetter { | ||
| private final byte typeWidth; | ||
| private final int typeWidth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not changed the derived vector classes to use integer type. Should I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TYPE_WIDTH const in the subclass is a static variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for other types, they can be static, for fixed size binary, it has to be non-static
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if you can update BitVector since it currently casts down to a byte
| if (isSet(index) == 0) { | ||
| return null; | ||
| } else { | ||
| return get(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand this function to avoid duplicate check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't see the change. Did you push the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. This looks good now.
| if (byteWidth == 0) { | ||
| byteWidth = value.length; | ||
| } else if (byteWidth != value.length) { | ||
| throw new IOException("mismatch byte width (" + value.length + ") at index " + i + ", expecting " + byteWidth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the input is malformatted? If this is, then it probably shouldn't be IOException..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to RuntimeException
| } | ||
| } | ||
| if (count > 0 && byteWidth == 0) { | ||
| throw new IOException("could not determine the byte width of the vector because all elements are null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to RuntimeException
| values.add(value); | ||
| if (value.length > 0) { | ||
| if (byteWidth == 0) { | ||
| byteWidth = value.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you read the byteWidth from the schema rather than interpret it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byteWidth info is now collected from the vector and passed in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I am not sure I understand.. Why don't we use the byteWidth in schema to inform the expected length here?
|
|
||
| ArrowBuf buf = allocator.buffer(byteWidth * count); | ||
| for (byte[] value : values) { | ||
| buf.writeBytes(value.length == 0? new byte[byteWidth] : value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can there be empty values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For nullable vectors, when the value is null, the corresponding JSON field is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
| * @param index position of the element to set | ||
| * @param holder holder that carries data buffer. | ||
| */ | ||
| public void set(int index, NullableFixedSizeBinaryHolder holder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this FixedSizeBinaryHolder instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are both Holders and NullableHolders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
| to.copyFromSafe(fromIndex, toIndex, FixedSizeBinaryVector.this); | ||
| } | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
High level looks good to me. Need to come back and look in more detail. |
|
Anybody knows what should I do with the check failure: The command "sudo -E apt-get -yq --no-install-suggests --no-install-recommends --force-yes install gcc-4.9 g++-4.9 gdb ccache valgrind libboost-dev libboost-filesystem-dev libboost-system-dev libjemalloc-dev gtk-doc-tools autoconf-archive libgirepository1.0-dev" failed and exited with 100 during . Also, more comments on this PR? |
|
@siddharthteotia or @BryanCutler Could one of you please also take a look? Thanks! |
|
@alphalfalfa we've been experiencing a lot of apt flakiness in Travis CI. Someone should escalate it with the support system to see what's going on |
|
I just restarted the failing job |
| BufferReader INT8 = new BufferReader() { | ||
| @Override | ||
| protected ArrowBuf read(BufferAllocator allocator, int count) throws IOException { | ||
| protected ArrowBuf read(BufferAllocator allocator, int count, int byteWidth) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I am not sure about this. Having byteWidth in every reader seems weird because most don't need it. Can you put make a member variable of fixed size binary reader?
integration/integration_test.py
Outdated
| def _get_buffers(self): | ||
| data = [] | ||
| for i, v in enumerate(self.values): | ||
| data.append(self._encode_value(v if self.is_valid[i] else "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems strange. Why are we using empty string to represent null value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably just use any value for null instead of empty string. It should make the logic cleaner. (less if statements)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh... I put empty strings as null. sure, i can put some arbitrary strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the changes and simplified the logic inside JsonFileReader
| if (bufferType.equals(DATA) && (vector.getMinorType() == Types.MinorType.VARCHAR || | ||
| vector.getMinorType() == Types.MinorType.VARBINARY)) { | ||
| writeValueToGenerator(bufferType, vectorBuffer, vectorBuffers.get(v-1), vector, i, scale); | ||
| writeValueToGenerator(bufferType, vectorBuffer, vectorBuffers.get(v-1), vector, i, scale, byteWidth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method doesn't need to take scale and byteWidth. It can figure this out from vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, modified.
|
@siddharthteotia @jacques-n could we get some more eyes on this? Thanks! |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alphalfalfa , looks pretty good! I just had some minor suggestions
integration/integration_test.py
Outdated
| class FixedSizeBinaryType(PrimitiveType): | ||
|
|
||
| def __init__(self, name, byte_width, nullable=True): | ||
| PrimitiveType.__init__(self, name, nullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think super(..).__init__(..) is a little better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| public abstract class BaseFixedWidthVector extends BaseValueVector | ||
| implements FixedWidthVector, FieldVector, VectorDefinitionSetter { | ||
| private final byte typeWidth; | ||
| private final int typeWidth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if you can update BitVector since it currently casts down to a byte
|
|
||
| @Override | ||
| public TypeLayout visit(FixedSizeBinary type) { | ||
| return newFixedWidthTypeLayout(BufferLayout.dataBuffer(type.getByteWidth() * 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you just do?
return newFixedWidthTypeLayout(new BufferLayout(BufferType.DATA, type.getByteWidth() * 8)
Then you don't need to BufferLayout dataBuffer(int typeBitWidth)
and then keep the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| return VALUES_128; | ||
| default: | ||
| throw new IllegalArgumentException("only 8, 16, 32, or 64 bits supported"); | ||
| return new BufferLayout(BufferType.DATA, typeBitWidth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think function is really "get data buffer layout for types that support certain bit widths" - so it might be beneficial to keep the exception, although it looks like it should be updated to include 128. I'm not sure how exactly the exception would be thrown, but if somehow a buffer layout for an Int with 30 bit width was created, this would catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted and added 128 bits to the exception message
| values.add(value); | ||
| } | ||
|
|
||
| int byteWidth = count > 0? values.get(0).length : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add space before ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…ng dataBuffer() method and some other changes based on PR comments
|
@siddharthteotia Do you think we can merge this? LGTM but want to double check with you. |
|
It seems like we are missing support in the ComplexWriter and FieldReader interfaces. I think that should be included in introduction of these things. The patch here looks fine but I think we should include that before releasing this feature (either in this patch or a follow-up). |
|
Per my comment, I'm +1 on the spirit on these changes but think the patch (or feature) is somewhat incomplete |
|
@jacques-n, to make sure that I understand the question properly, do you mean there should be proper interfaces defined for FixedSizeBinary type inside AbstractFieldWriter and AbstractFieldReader? I would prefer to add the necessary fix in a separate PR as this one has grown too big if it is OK. |
|
You're right. I forgot that the code is autogenerated and was looking for it here. I'm +1 on this PR. |
|
thanks all, please open any follow-up JIRAs |
…tion tests (Updated) The original PR is at apache#1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later. Changes include: * Arrow-633: [Java] Add support for FixedWidthBinary type * Arrow-634: Add integration tests for FixedSizeBinary Author: Jingyuan Wang <jingyuan@live.com> Author: jingyuan <jingyuan.nt@gmail.com> Closes apache#1492 from alphalfalfa/updated-arrow-633-634 and squashes the following commits: c994a19 [jingyuan] create data buffer layout for FixedSizeBinary directly instead of using dataBuffer() method and some other changes based on PR comments 873c5b2 [jingyuan] do not pass scale or byteWidth into writeValueToGenerator in JsonFileWriter.java 4fbb67d [jingyuan] put arbitrary values for FixSizeBinary nulls 0c0015d [Jingyuan Wang] add a new line a8a9553 [Jingyuan Wang] Pass byteWidth info for FixedSizeBinary type in JsonFileReader 1a64c5c [Jingyuan Wang] expand get() method inside getObject() method to remove duplicate check of isSet() 071fb25 [Jingyuan Wang] ARROW-633/634: Add FixedSizeBinary support in Java and integration tests
The original PR is at #1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later.
Changes include: