-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1837: [Java][Integration] Fix unsigned round trip integration tests #4432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,13 +42,12 @@ public static void convert(InputStream in, OutputStream out) throws IOException | |
| BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); | ||
| try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { | ||
| VectorSchemaRoot root = reader.getVectorSchemaRoot(); | ||
| // load the first batch before instantiating the writer so that we have any dictionaries | ||
| if (!reader.loadNextBatch()) { | ||
| throw new IOException("Unable to read first record batch"); | ||
| } | ||
| // load the first batch before instantiating the writer so that we have any dictionaries. | ||
|
||
| // Only writeBatches if we load the first one. | ||
| boolean writeBatches = reader.loadNextBatch(); | ||
| try (ArrowFileWriter writer = new ArrowFileWriter(root, reader, Channels.newChannel(out))) { | ||
| writer.start(); | ||
| while (true) { | ||
| while (writeBatches) { | ||
| writer.writeBatch(); | ||
| if (!reader.loadNextBatch()) { | ||
| break; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,8 @@ | |
| import org.apache.arrow.vector.types.pojo.FieldType; | ||
| import org.apache.arrow.vector.util.TransferPair; | ||
|
|
||
| import io.netty.buffer.ArrowBuf; | ||
|
|
||
| /** | ||
| * UInt1Vector implements a fixed width (1 bytes) vector of | ||
| * integer values which could be null. A validity buffer (bit vector) is | ||
|
|
@@ -62,6 +64,23 @@ public MinorType getMinorType() { | |
| | vector value retrieval methods | | ||
| | | | ||
| *----------------------------------------------------------------*/ | ||
| /** | ||
| * Given a data buffer, get the value stored at a particular position | ||
| * in the vector. | ||
| * | ||
| * <p>To avoid overflow, the returned type is one step up from the signed | ||
| * type. | ||
| * | ||
| * <p>This method is mainly meant for integration tests. | ||
| * | ||
| * @param buffer data buffer | ||
| * @param index position of the element. | ||
| * @return value stored at the index. | ||
| */ | ||
| public static short getNoOverflow(final ArrowBuf buffer, final int index) { | ||
|
||
| byte b = buffer.getByte(index * TYPE_WIDTH); | ||
| return (short)(0xFF & b); | ||
| } | ||
|
|
||
|
|
||
| /** | ||
|
|
@@ -107,6 +126,20 @@ public Byte getObject(int index) { | |
| } | ||
| } | ||
|
|
||
| /** | ||
| * Returns the value stored at index without the potential for overflow. | ||
| * | ||
| * @param index position of element | ||
| * @return element at given index | ||
| */ | ||
| public Short getObjectNoOverflow(int index) { | ||
| if (isSet(index) == 0) { | ||
|
||
| return null; | ||
| } else { | ||
| return getNoOverflow(valueBuffer, index); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Copies the value at fromIndex to thisIndex (including validity). | ||
| */ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not different from the behavior earlier - we would have thrown exception earlier for zero batches vs what it looks like proceeding silently now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intentional. This was discussed on the ML, we think it is reasonable to have a schema without batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it not be better to do that in a different method? I am not sure if existing clients depend on this for validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. Given the fact that there are no unit tests for this code path makes me think this this is only used in integration tests. Are you familiar with any additional use cases? Also this change seems like the correct behavior (I don't see why no batches is an exception) but I suppose it would be better to validate at the file level that there wasn't any corruption (but I don't think there are provisions for this in the spec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure of exact use cases, but have been dealing with a lot of breaking changes while upgrading dremio to use latest arrow and am wary of anything that looks like change in behavior :)
Agree the new behavior seems like the right one to implement. Sounds good for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood at some point we should discuss on the ML about API/behavior stability