Skip to content

Conversation

@alphalfalfa
Copy link
Contributor

The changes include:

  • Arrow-633: [Java] Add support for FixedWidthBinary type
  • Arrow-634: Add integration tests for FixedSizeBinary
  • Fixed a bug of integration test suite that primitive case with non-zero batch sizes was ignored

I also commented on changes that needs more review.

fields.append(get_field(type_ + "_nonnullable", type_, False))

return _generate_file("primitive", fields, batch_sizes)
name = "primitive"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added name suffix for primitive case such that primitive test case at line 823/824 become both effective. Previously the first one was overwritten by the second one and hence ignored.

]
},
{
major: "Fixed",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FixedSizeBinary has a variable width, I am not sure what integer value to put here. I went ahead and put -1 to mark it as a special case. This changes also leads to quite some changes of if statements in the codegen template FixedValueVectors.java.

Copy link
Contributor

@icexelloss icexelloss Aug 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this change. FixValueVectors have static type width that is specified in ValueVectorTypes.tdd. Having type width to be dynamic would complicated the template and result in inconsistency in the resulting classes.

Is there a reason FixedSizeBinary needs to have dynamic type witdh? i.e., can we have byteWitdh = 1 and use the type width in the Schema to retrieve and store values? This would simplify things.

cc @wesm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bottom line is that FixedSizeBinaryVector class should have a way to access the information about the dynamic type width. The only way I found is by changing the static type width in the class to non-static and initialize in the constructor. Did I miss a better way to pass the information? I don't like to put a lot of if/elseif/else statements in a codegen template either :)

Regarding the value being set to a positive integer like 1, it is highly misleading for readers in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah byteWidth = 1 wouldn't work because of how the template works. But I think we should still have typeBitWidth = 8 in the type layout in the schema to match Binary type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icexelloss, the VarBinary type has a width of 4 (instead of 8) which make sense because each element has a 4-byte offset. Do you think it is better to use 4 as well? Btw, VarBinary is using a different template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset is 4 bytes wide. Data width is 1 byte wide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking about the type layout in the schema. Not the constant in the vector class. Sorry for the confusion.

Copy link
Contributor

@jacques-n jacques-n Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably refactor the templates some before introducing this change. In reality, I'd like to propose that we rework the vector hierarchy entirely and work much more like ArrowBuf and it's related parents work. In that case, we just grab a memory address and bypass all the object hierarchy for high performance methods but still use superclass delegation for non-performance critical methods.

For the ValueVector hierarchy/codegen situation, let's implement two classes:

BaseFixedVector (would take a dynamic size variable)
BaseVariableVector

These would both be non-code generated classes. Then we could generate code generated subclasses of these that only implement methods that need code generation for performance. Most of these methods would probably appear as custom methods in mutator/accessors.

Then, the changes for a new Custom fixed width vector would just be specializations of the BaseFixedWidthVector called something like FixedBinaryVector and FixedVarcharVector and in those specialization, the width in the mutators and accessors could come from a instance variable instead of a constant.

I believe this would allow us to simplify the codebase and make maintenance of the base classes much easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me. I'll wait for the refactoring to complete and come back to modify this PR

if (valueCount == 0) {
return 0;
}
return valueCount * ${type.width};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FixedSizeBinary does not have a static type.width, I replaced all ${type.width} to TYPE_WIDTH.

}

public void set(int index, ArrowBuf buffer){
assert TYPE_WIDTH <= buffer.capacity();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to validate an input of ArrowBuf type?

return VALUES_64;
default:
throw new IllegalArgumentException("only 8, 16, 32, or 64 bits supported");
return new VectorLayout(DATA, typeBitWidth);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing C++ implemetation allows any typeBitWidth, I followed and remove this exception. Is this correct?

@alphalfalfa alphalfalfa changed the title WIP: ARROW-633/ARROW-634: Add Java support and integration test for FixedSizeBinary ARROW-633/ARROW-634: Add Java support and integration test for FixedSizeBinary Aug 30, 2017
}

@Override
public TypeLayout visit(ArrowType.FixedSizeBinary type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be newVariableWidthTypeLayout() to be consistent with Binary type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should still be FixedWidthTypeLayout because each element is fixed-size and no offset vector is necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant the data type width should be 8-bit instead of type.getByteWidth() * 8.

VarBinary has 8-bit data type width and FixedBinary should be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now I read more of the doc, I am not too sure.

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L225

@wesm, can you clarify the what the correct bit_width in the vector layout should be?

ByteArrayOutputStream stream = new ByteArrayOutputStream();
final int numValues = 10;
final int typeWidth = 11;
byte[][] byteValues = new byte[numValues][typeWidth];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the style of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I assume you mean the spacing if the for loop, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@elahrvivaz
Copy link
Contributor

lgtm

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @jacques-n can you have a look?

@wesm
Copy link
Member

wesm commented Sep 4, 2017

@alphalfalfa can you rebase?

* ${minor.class} implements a vector of fixed width values. Elements in the vector are accessed
* by position, starting from the logical start of the vector. Values should be pushed onto the
* vector sequentially, but may be randomly accessed.
<#if (type.width > 0) >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to minimize the if else statement in the template to simply. I prefer we remove this if else here and state something like:

The width of each element is defined in getWidth()

<#if (type.width < 0)>
public final int TYPE_WIDTH;
<#else>
public static final int TYPE_WIDTH = ${type.width};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about having TYPE_WIDTH defined as static in some classes and non static in others.

I think it's cleaner if we define TYPE_WIDTH only as static and define another class method getTypeWidth()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason is all caps are usually used to define static constant. Type width for fixed binary is not constant. so I think it's cleaner to not reuse the same variable for fixed binary.

<#if (type.width < 0 || type.width > 8 || minor.class == "IntervalDay")>
public ${minor.javaType!type.javaType} get(int index) {
return data.slice(index * ${type.width}, ${type.width});
<#if (minor.class == "FixedSizeBinary")>
Copy link
Contributor

@icexelloss icexelloss Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner if we get FixedSizeBinary into it's top level:

<#if (type.width) < 0 && minor.class == "FixedSizeBinary">
public ${minor.javaType!type.javaType} get(int index) {
   ...
}
</#if>

Since it doesn't share logic with the original get method

* ${minor.class}.Mutator implements a mutable vector of fixed width values. Elements in the
* vector are accessed by position from the logical start of the vector. Values should be pushed
* onto the vector sequentially, but may be randomly accessed.
<#if (type.width > 0)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. I prefer we don't have the if else here.


<#assign fields = minor.fields!type.fields />
public void set(int index, int isSet<#list fields as field>, ${field.type} ${field.name}Field</#list> ){
public void set(int index, int isSet<#list fields as field><#if field.include!true >, ${field.type} ${field.name}Field</#if></#list> ){
Copy link
Contributor

@icexelloss icexelloss Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing, can you explain this logic? Seems you added field.include = false just for FixedBinaryVector, why?

* @param value array of bytes (or int if smaller than 4 bytes) to write
*/
public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType}</#if> value) {
public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width >= 0 && type.width < 4)>int<#else>${minor.javaType!type.javaType}</#if> value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is not:

public void set(int index, <#if type.major == "VarLen" && minor.class == "FixedSizeBinary">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType}</#if> value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, the java type for fixed size binary is byte[] so this would be equivalent.

}

<#elseif minor.class == "FixedSizeBinary">
public void set(int index, ${minor.class}Holder holder){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These set methods looks similar to the existing ones, why do we need to have special code gen for these?

wesm pushed a commit that referenced this pull request Feb 7, 2018
…tion tests (Updated)

The original PR is at #1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later.

Changes include:
* Arrow-633: [Java] Add support for FixedWidthBinary type
* Arrow-634: Add integration tests for FixedSizeBinary

Author: Jingyuan Wang <jingyuan@live.com>
Author: jingyuan <jingyuan.nt@gmail.com>

Closes #1492 from alphalfalfa/updated-arrow-633-634 and squashes the following commits:

c994a19 [jingyuan] create data buffer layout for FixedSizeBinary directly instead of using dataBuffer() method and some other changes based on PR comments
873c5b2 [jingyuan] do not pass scale or byteWidth into writeValueToGenerator in JsonFileWriter.java
4fbb67d [jingyuan] put arbitrary values for FixSizeBinary nulls
0c0015d [Jingyuan Wang] add a new line
a8a9553 [Jingyuan Wang] Pass byteWidth info for FixedSizeBinary type in JsonFileReader
1a64c5c [Jingyuan Wang] expand get() method inside getObject() method to remove duplicate check of isSet()
071fb25 [Jingyuan Wang] ARROW-633/634: Add FixedSizeBinary support in Java and integration tests
@alphalfalfa alphalfalfa closed this Feb 7, 2018
@alphalfalfa alphalfalfa deleted the arrow-633-634 branch February 8, 2018 22:27
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…tion tests (Updated)

The original PR is at apache#1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later.

Changes include:
* Arrow-633: [Java] Add support for FixedWidthBinary type
* Arrow-634: Add integration tests for FixedSizeBinary

Author: Jingyuan Wang <jingyuan@live.com>
Author: jingyuan <jingyuan.nt@gmail.com>

Closes apache#1492 from alphalfalfa/updated-arrow-633-634 and squashes the following commits:

c994a19 [jingyuan] create data buffer layout for FixedSizeBinary directly instead of using dataBuffer() method and some other changes based on PR comments
873c5b2 [jingyuan] do not pass scale or byteWidth into writeValueToGenerator in JsonFileWriter.java
4fbb67d [jingyuan] put arbitrary values for FixSizeBinary nulls
0c0015d [Jingyuan Wang] add a new line
a8a9553 [Jingyuan Wang] Pass byteWidth info for FixedSizeBinary type in JsonFileReader
1a64c5c [Jingyuan Wang] expand get() method inside getObject() method to remove duplicate check of isSet()
071fb25 [Jingyuan Wang] ARROW-633/634: Add FixedSizeBinary support in Java and integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants