-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-633/ARROW-634: Add Java support and integration test for FixedSizeBinary #1012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| fields.append(get_field(type_ + "_nonnullable", type_, False)) | ||
|
|
||
| return _generate_file("primitive", fields, batch_sizes) | ||
| name = "primitive" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added name suffix for primitive case such that primitive test case at line 823/824 become both effective. Previously the first one was overwritten by the second one and hence ignored.
| ] | ||
| }, | ||
| { | ||
| major: "Fixed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FixedSizeBinary has a variable width, I am not sure what integer value to put here. I went ahead and put -1 to mark it as a special case. This changes also leads to quite some changes of if statements in the codegen template FixedValueVectors.java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this change. FixValueVectors have static type width that is specified in ValueVectorTypes.tdd. Having type width to be dynamic would complicated the template and result in inconsistency in the resulting classes.
Is there a reason FixedSizeBinary needs to have dynamic type witdh? i.e., can we have byteWitdh = 1 and use the type width in the Schema to retrieve and store values? This would simplify things.
cc @wesm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bottom line is that FixedSizeBinaryVector class should have a way to access the information about the dynamic type width. The only way I found is by changing the static type width in the class to non-static and initialize in the constructor. Did I miss a better way to pass the information? I don't like to put a lot of if/elseif/else statements in a codegen template either :)
Regarding the value being set to a positive integer like 1, it is highly misleading for readers in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yeah byteWidth = 1 wouldn't work because of how the template works. But I think we should still have typeBitWidth = 8 in the type layout in the schema to match Binary type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss, the VarBinary type has a width of 4 (instead of 8) which make sense because each element has a 4-byte offset. Do you think it is better to use 4 as well? Btw, VarBinary is using a different template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset is 4 bytes wide. Data width is 1 byte wide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about the type layout in the schema. Not the constant in the vector class. Sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably refactor the templates some before introducing this change. In reality, I'd like to propose that we rework the vector hierarchy entirely and work much more like ArrowBuf and it's related parents work. In that case, we just grab a memory address and bypass all the object hierarchy for high performance methods but still use superclass delegation for non-performance critical methods.
For the ValueVector hierarchy/codegen situation, let's implement two classes:
BaseFixedVector (would take a dynamic size variable)
BaseVariableVector
These would both be non-code generated classes. Then we could generate code generated subclasses of these that only implement methods that need code generation for performance. Most of these methods would probably appear as custom methods in mutator/accessors.
Then, the changes for a new Custom fixed width vector would just be specializations of the BaseFixedWidthVector called something like FixedBinaryVector and FixedVarcharVector and in those specialization, the width in the mutators and accessors could come from a instance variable instead of a constant.
I believe this would allow us to simplify the codebase and make maintenance of the base classes much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened up https://issues.apache.org/jira/browse/ARROW-1463 to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me. I'll wait for the refactoring to complete and come back to modify this PR
| if (valueCount == 0) { | ||
| return 0; | ||
| } | ||
| return valueCount * ${type.width}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FixedSizeBinary does not have a static type.width, I replaced all ${type.width} to TYPE_WIDTH.
| } | ||
|
|
||
| public void set(int index, ArrowBuf buffer){ | ||
| assert TYPE_WIDTH <= buffer.capacity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to validate an input of ArrowBuf type?
| return VALUES_64; | ||
| default: | ||
| throw new IllegalArgumentException("only 8, 16, 32, or 64 bits supported"); | ||
| return new VectorLayout(DATA, typeBitWidth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing C++ implemetation allows any typeBitWidth, I followed and remove this exception. Is this correct?
| } | ||
|
|
||
| @Override | ||
| public TypeLayout visit(ArrowType.FixedSizeBinary type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be newVariableWidthTypeLayout() to be consistent with Binary type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should still be FixedWidthTypeLayout because each element is fixed-size and no offset vector is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant the data type width should be 8-bit instead of type.getByteWidth() * 8.
VarBinary has 8-bit data type width and FixedBinary should be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now I read more of the doc, I am not too sure.
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L225
@wesm, can you clarify the what the correct bit_width in the vector layout should be?
| ByteArrayOutputStream stream = new ByteArrayOutputStream(); | ||
| final int numValues = 10; | ||
| final int typeWidth = 11; | ||
| byte[][] byteValues = new byte[numValues][typeWidth]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the style of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I assume you mean the spacing if the for loop, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
lgtm |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @jacques-n can you have a look?
|
@alphalfalfa can you rebase? |
| * ${minor.class} implements a vector of fixed width values. Elements in the vector are accessed | ||
| * by position, starting from the logical start of the vector. Values should be pushed onto the | ||
| * vector sequentially, but may be randomly accessed. | ||
| <#if (type.width > 0) > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to minimize the if else statement in the template to simply. I prefer we remove this if else here and state something like:
The width of each element is defined in getWidth()
| <#if (type.width < 0)> | ||
| public final int TYPE_WIDTH; | ||
| <#else> | ||
| public static final int TYPE_WIDTH = ${type.width}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about having TYPE_WIDTH defined as static in some classes and non static in others.
I think it's cleaner if we define TYPE_WIDTH only as static and define another class method getTypeWidth()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason is all caps are usually used to define static constant. Type width for fixed binary is not constant. so I think it's cleaner to not reuse the same variable for fixed binary.
| <#if (type.width < 0 || type.width > 8 || minor.class == "IntervalDay")> | ||
| public ${minor.javaType!type.javaType} get(int index) { | ||
| return data.slice(index * ${type.width}, ${type.width}); | ||
| <#if (minor.class == "FixedSizeBinary")> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner if we get FixedSizeBinary into it's top level:
<#if (type.width) < 0 && minor.class == "FixedSizeBinary">
public ${minor.javaType!type.javaType} get(int index) {
...
}
</#if>
Since it doesn't share logic with the original get method
| * ${minor.class}.Mutator implements a mutable vector of fixed width values. Elements in the | ||
| * vector are accessed by position from the logical start of the vector. Values should be pushed | ||
| * onto the vector sequentially, but may be randomly accessed. | ||
| <#if (type.width > 0)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. I prefer we don't have the if else here.
|
|
||
| <#assign fields = minor.fields!type.fields /> | ||
| public void set(int index, int isSet<#list fields as field>, ${field.type} ${field.name}Field</#list> ){ | ||
| public void set(int index, int isSet<#list fields as field><#if field.include!true >, ${field.type} ${field.name}Field</#if></#list> ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing, can you explain this logic? Seems you added field.include = false just for FixedBinaryVector, why?
| * @param value array of bytes (or int if smaller than 4 bytes) to write | ||
| */ | ||
| public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType}</#if> value) { | ||
| public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width >= 0 && type.width < 4)>int<#else>${minor.javaType!type.javaType}</#if> value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is not:
public void set(int index, <#if type.major == "VarLen" && minor.class == "FixedSizeBinary">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType}</#if> value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, the java type for fixed size binary is byte[] so this would be equivalent.
| } | ||
|
|
||
| <#elseif minor.class == "FixedSizeBinary"> | ||
| public void set(int index, ${minor.class}Holder holder){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These set methods looks similar to the existing ones, why do we need to have special code gen for these?
…tion tests (Updated) The original PR is at #1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later. Changes include: * Arrow-633: [Java] Add support for FixedWidthBinary type * Arrow-634: Add integration tests for FixedSizeBinary Author: Jingyuan Wang <jingyuan@live.com> Author: jingyuan <jingyuan.nt@gmail.com> Closes #1492 from alphalfalfa/updated-arrow-633-634 and squashes the following commits: c994a19 [jingyuan] create data buffer layout for FixedSizeBinary directly instead of using dataBuffer() method and some other changes based on PR comments 873c5b2 [jingyuan] do not pass scale or byteWidth into writeValueToGenerator in JsonFileWriter.java 4fbb67d [jingyuan] put arbitrary values for FixSizeBinary nulls 0c0015d [Jingyuan Wang] add a new line a8a9553 [Jingyuan Wang] Pass byteWidth info for FixedSizeBinary type in JsonFileReader 1a64c5c [Jingyuan Wang] expand get() method inside getObject() method to remove duplicate check of isSet() 071fb25 [Jingyuan Wang] ARROW-633/634: Add FixedSizeBinary support in Java and integration tests
…tion tests (Updated) The original PR is at apache#1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later. Changes include: * Arrow-633: [Java] Add support for FixedWidthBinary type * Arrow-634: Add integration tests for FixedSizeBinary Author: Jingyuan Wang <jingyuan@live.com> Author: jingyuan <jingyuan.nt@gmail.com> Closes apache#1492 from alphalfalfa/updated-arrow-633-634 and squashes the following commits: c994a19 [jingyuan] create data buffer layout for FixedSizeBinary directly instead of using dataBuffer() method and some other changes based on PR comments 873c5b2 [jingyuan] do not pass scale or byteWidth into writeValueToGenerator in JsonFileWriter.java 4fbb67d [jingyuan] put arbitrary values for FixSizeBinary nulls 0c0015d [Jingyuan Wang] add a new line a8a9553 [Jingyuan Wang] Pass byteWidth info for FixedSizeBinary type in JsonFileReader 1a64c5c [Jingyuan Wang] expand get() method inside getObject() method to remove duplicate check of isSet() 071fb25 [Jingyuan Wang] ARROW-633/634: Add FixedSizeBinary support in Java and integration tests
The changes include:
I also commented on changes that needs more review.