Skip to content

Conversation

@alphalfalfa
Copy link
Contributor

The original PR is at #1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later.

Changes include:

  • Arrow-633: [Java] Add support for FixedWidthBinary type
  • Arrow-634: Add integration tests for FixedSizeBinary

public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final byte typeWidth;
private final int typeWidth;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not changed the derived vector classes to use integer type. Should I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TYPE_WIDTH const in the subclass is a static variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for other types, they can be static, for fixed size binary, it has to be non-static

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if you can update BitVector since it currently casts down to a byte

@alphalfalfa
Copy link
Contributor Author

@wesm @icexelloss

if (isSet(index) == 0) {
return null;
} else {
return get(index);
Copy link
Contributor

@icexelloss icexelloss Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand this function to avoid duplicate check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't see the change. Did you push the change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. This looks good now.

if (byteWidth == 0) {
byteWidth = value.length;
} else if (byteWidth != value.length) {
throw new IOException("mismatch byte width (" + value.length + ") at index " + i + ", expecting " + byteWidth);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the input is malformatted? If this is, then it probably shouldn't be IOException..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to RuntimeException

}
}
if (count > 0 && byteWidth == 0) {
throw new IOException("could not determine the byte width of the vector because all elements are null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to RuntimeException

values.add(value);
if (value.length > 0) {
if (byteWidth == 0) {
byteWidth = value.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you read the byteWidth from the schema rather than interpret it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byteWidth info is now collected from the vector and passed in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am not sure I understand.. Why don't we use the byteWidth in schema to inform the expected length here?


ArrowBuf buf = allocator.buffer(byteWidth * count);
for (byte[] value : values) {
buf.writeBytes(value.length == 0? new byte[byteWidth] : value);
Copy link
Contributor

@icexelloss icexelloss Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can there be empty values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nullable vectors, when the value is null, the corresponding JSON field is empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

* @param index position of the element to set
* @param holder holder that carries data buffer.
*/
public void set(int index, NullableFixedSizeBinaryHolder holder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this FixedSizeBinaryHolder instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are both Holders and NullableHolders.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

to.copyFromSafe(fromIndex, toIndex, FixedSizeBinaryVector.this);
}
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@icexelloss
Copy link
Contributor

High level looks good to me. Need to come back and look in more detail.

@alphalfalfa
Copy link
Contributor Author

Anybody knows what should I do with the check failure:

The command "sudo -E apt-get -yq --no-install-suggests --no-install-recommends --force-yes install gcc-4.9 g++-4.9 gdb ccache valgrind libboost-dev libboost-filesystem-dev libboost-system-dev libjemalloc-dev gtk-doc-tools autoconf-archive libgirepository1.0-dev" failed and exited with 100 during .

Also, more comments on this PR?

@icexelloss
Copy link
Contributor

@siddharthteotia or @BryanCutler Could one of you please also take a look? Thanks!

@wesm
Copy link
Member

wesm commented Jan 23, 2018

@alphalfalfa we've been experiencing a lot of apt flakiness in Travis CI. Someone should escalate it with the support system to see what's going on

@wesm
Copy link
Member

wesm commented Jan 23, 2018

I just restarted the failing job

BufferReader INT8 = new BufferReader() {
@Override
protected ArrowBuf read(BufferAllocator allocator, int count) throws IOException {
protected ArrowBuf read(BufferAllocator allocator, int count, int byteWidth) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I am not sure about this. Having byteWidth in every reader seems weird because most don't need it. Can you put make a member variable of fixed size binary reader?

def _get_buffers(self):
data = []
for i, v in enumerate(self.values):
data.append(self._encode_value(v if self.is_valid[i] else ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange. Why are we using empty string to represent null value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just use any value for null instead of empty string. It should make the logic cleaner. (less if statements)

Copy link
Contributor Author

@alphalfalfa alphalfalfa Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... I put empty strings as null. sure, i can put some arbitrary strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the changes and simplified the logic inside JsonFileReader

if (bufferType.equals(DATA) && (vector.getMinorType() == Types.MinorType.VARCHAR ||
vector.getMinorType() == Types.MinorType.VARBINARY)) {
writeValueToGenerator(bufferType, vectorBuffer, vectorBuffers.get(v-1), vector, i, scale);
writeValueToGenerator(bufferType, vectorBuffer, vectorBuffers.get(v-1), vector, i, scale, byteWidth);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't need to take scale and byteWidth. It can figure this out from vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, modified.

@wesm
Copy link
Member

wesm commented Jan 29, 2018

@siddharthteotia @jacques-n could we get some more eyes on this? Thanks!

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alphalfalfa , looks pretty good! I just had some minor suggestions

class FixedSizeBinaryType(PrimitiveType):

def __init__(self, name, byte_width, nullable=True):
PrimitiveType.__init__(self, name, nullable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think super(..).__init__(..) is a little better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final byte typeWidth;
private final int typeWidth;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if you can update BitVector since it currently casts down to a byte


@Override
public TypeLayout visit(FixedSizeBinary type) {
return newFixedWidthTypeLayout(BufferLayout.dataBuffer(type.getByteWidth() * 8));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you just do?

return newFixedWidthTypeLayout(new BufferLayout(BufferType.DATA, type.getByteWidth() * 8)

Then you don't need to BufferLayout dataBuffer(int typeBitWidth)
and then keep the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return VALUES_128;
default:
throw new IllegalArgumentException("only 8, 16, 32, or 64 bits supported");
return new BufferLayout(BufferType.DATA, typeBitWidth);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think function is really "get data buffer layout for types that support certain bit widths" - so it might be beneficial to keep the exception, although it looks like it should be updated to include 128. I'm not sure how exactly the exception would be thrown, but if somehow a buffer layout for an Int with 30 bit width was created, this would catch it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted and added 128 bits to the exception message

values.add(value);
}

int byteWidth = count > 0? values.get(0).length : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add space before ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

…ng dataBuffer() method and some other changes based on PR comments
@icexelloss
Copy link
Contributor

@siddharthteotia Do you think we can merge this? LGTM but want to double check with you.

@wesm wesm changed the title ARROW-633/634: Add FixedSizeBinary support in Java and integration tests (Updated) ARROW-633/634: [Java] Add FixedSizeBinary support in Java and integration tests (Updated) Feb 5, 2018
@jacques-n
Copy link
Contributor

It seems like we are missing support in the ComplexWriter and FieldReader interfaces. I think that should be included in introduction of these things. The patch here looks fine but I think we should include that before releasing this feature (either in this patch or a follow-up).

@jacques-n
Copy link
Contributor

Per my comment, I'm +1 on the spirit on these changes but think the patch (or feature) is somewhat incomplete

@alphalfalfa
Copy link
Contributor Author

@jacques-n, to make sure that I understand the question properly, do you mean there should be proper interfaces defined for FixedSizeBinary type inside AbstractFieldWriter and AbstractFieldReader?
Currently, AbstractFieldWriter has:
public void writeFixedSizeBinary(ArrowBuf buffer)
AbstractFieldReaderWriter has:
public byte[] readByteArray()
What would be the desired interfaces to add?

I would prefer to add the necessary fix in a separate PR as this one has grown too big if it is OK.

@jacques-n
Copy link
Contributor

You're right. I forgot that the code is autogenerated and was looking for it here. I'm +1 on this PR.

@wesm
Copy link
Member

wesm commented Feb 7, 2018

thanks all, please open any follow-up JIRAs

@wesm wesm closed this in f69e9db Feb 7, 2018
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…tion tests (Updated)

The original PR is at apache#1012. Due to the major refactoring last year, changes are big. So I created this separate PR for easier to review and will cancel the other one later.

Changes include:
* Arrow-633: [Java] Add support for FixedWidthBinary type
* Arrow-634: Add integration tests for FixedSizeBinary

Author: Jingyuan Wang <jingyuan@live.com>
Author: jingyuan <jingyuan.nt@gmail.com>

Closes apache#1492 from alphalfalfa/updated-arrow-633-634 and squashes the following commits:

c994a19 [jingyuan] create data buffer layout for FixedSizeBinary directly instead of using dataBuffer() method and some other changes based on PR comments
873c5b2 [jingyuan] do not pass scale or byteWidth into writeValueToGenerator in JsonFileWriter.java
4fbb67d [jingyuan] put arbitrary values for FixSizeBinary nulls
0c0015d [Jingyuan Wang] add a new line
a8a9553 [Jingyuan Wang] Pass byteWidth info for FixedSizeBinary type in JsonFileReader
1a64c5c [Jingyuan Wang] expand get() method inside getObject() method to remove duplicate check of isSet()
071fb25 [Jingyuan Wang] ARROW-633/634: Add FixedSizeBinary support in Java and integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants