-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1238: [Java] Adding Decimal type JSON read and write support #994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1238: [Java] Adding Decimal type JSON read and write support #994
Conversation
| ArrowType.Decimal type = (ArrowType.Decimal) vector.getField().getType(); | ||
| BigDecimal decimalValue = new BigDecimal(BigInteger.valueOf(value), type.getScale()); | ||
| DecimalUtility.writeBigDecimalToArrowBuf(decimalValue, vector.getBuffer(), index); | ||
| vector.getValidityVector().getMutator().setToOne(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This procedure to write BigDecimal values to a vector is pretty awkward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I prefer that we add setSafe(index, decimalValue)
| // Verify decimal 1 vector | ||
| BigDecimal readValue = decimalVector1.getAccessor().getObject(i); | ||
| ArrowType.Decimal type = (ArrowType.Decimal) decimalVector1.getField().getType(); | ||
| BigDecimal genValue = new BigDecimal(BigInteger.valueOf(i), type.getScale()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be much cleaner to get the scale with #972 merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler is #972 just for testing purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that PR came from me adding Timestamp to Spark. I needed to check the time zone string given a NullableTimeStampMicroTZVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
|
It ended up to be really awkward to write decimal values to the vector. Maybe there is an easier way I just missed? The
This is not very user friendly and there is lots of room for error - it also unsafe and assumes the vector already has the needed capacity. I'm assuming we don't want the vector API to directly accept a What are your thoughts on the above @jacques-n @julienledem @icexelloss ? |
|
Definitely agree that the interface can be improved. I see no reason not to support BigDecimal as another interface. We'd just need to assert that precision/scale match expected. I'm not quite following the note on the validity vector and the capacity management as NullableDecimalVector.Mutator.setSafe* methods have numerous different options: |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that decimals are always 16 bytes in memory in Java. In C++ at the moment we determine the byte width based on the precision. I think we can make all decimals 16-bytes (versus 4, 8, and 16 depending on the precision) but it would require refactoring, so we should discuss. It's too bad we didn't dig into this sooner. cc @cpcloud
|
Hi @jacques-n , thanks for your response. Regarding the Instead, it is more efficient to write the decimal value right to the inner vector data buffer, e.g Are you saying it would be ok to add the following methods to If not, I could probably add these type of functions to the |
@wesm , I could look into adding support for 8 and 4 bytes values either before or after this PR is done so we could get integration fully working for decimals? |
|
It feels awkward to me to have to use |
| Mutator mutator = valueVector.getMutator(); | ||
|
|
||
| int innerVectorCount = vectorType.equals(OFFSET) ? count + 1 : count; | ||
| valueVector.setInitialCapacity(innerVectorCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the vectors do not use setSafe so reading beyond the default capacity would cause problems. Maybe for cases where the majority of values are null this is overdoing it, but I think that is less common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| case VARBINARY: | ||
| String hexString = Hex.encodeHexString(((VarBinaryVector) valueVector).getAccessor().get(i)); | ||
| generator.writeObject(hexString); | ||
| case VARBINARY: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the brackets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's there to limit the scope of the variables declared in the case block to that case only. Since now there are 2 blocks decoding hex values, just to prevent using the wrong variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks for the review @icexelloss . I'll try out adding set methods for BigDecimal to the vector classes, but I do wonder if maybe there was a design reason to not do this. Sort of like how the date/time vectors do not have and set methods with friendly classes, but rely on the |
|
I updated with adding these methods the Decimal vector APIs and |
| } | ||
|
|
||
| public void set(int index, ${friendlyType} value){ | ||
| DecimalUtility.writeBigDecimalToArrowBuf(value, data, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should verify precision and scale
| } | ||
|
|
||
| public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int startIndex, int scale) { | ||
| public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int index, int scale) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the consistency. Can you add the doc explaining index is not the byte index, but the value index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, adding some docs would be nice here
| values[i] = decimal; | ||
| decimalVector.getMutator().setIndexDefined(i); | ||
| DecimalUtility.writeBigDecimalToArrowBuf(decimal, decimalVector.getBuffer(), i); | ||
| decimalVector.getMutator().setSafe(i, decimal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case to cover precision/scale mismatch with the setSafe(int, BigDecimal) method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What mismatch to you mean exactly - like if the BigDecimal precision is more than Arrow supports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think,
For scale, we should check BigDecimal.scale() equals the vector.scale
For precision, we should check BigDecimal.precision() <= vector.precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right definitely. I'll add that thanks!
| } | ||
|
|
||
| public static void writeBigDecimalToArrowBuf(BigDecimal value, ArrowBuf bytebuf, int index) { | ||
| final byte[] bytes = value.unscaledValue().toByteArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there issues with endianness here? Java is big endian: https://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#toByteArray() but Arrow is little endian? https://github.com/apache/arrow/blob/master/format/Layout.md#byte-order-endianness
cc @cpcloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with @cpcloud offline. This is consistent with the C++ implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking on that
|
Updated with check for BigDecimal mismatch and tests, and added docs to DecimalUtility |
|
@BryanCutler Thank you! LGTM. @jacques-n Do you want to take another look at the setSafe(BigDecimal) method? |
|
Is there anything preventing this from being merged? |
|
I don't think so, let me take a quick look and then merge if all looks good |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, merging so that work on integration tests can proceed. If @jacques-n or @julienledem could take a last look to make sure nothing is amiss with the new methods here in case we need to fix before 0.7.0 goes out
|
thanks @BryanCutler and @icexelloss and others! |
This change adds support to reading and writing Decimal type vectors from JSON files. Data values are written as encoded hex and padded with 0's up to 16 bytes. Added roundtrip unit tests. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#994 from BryanCutler/java-json-decimal-support-ARROW-1238 and squashes the following commits: 28c1e3e [Bryan Cutler] added test for BigDecimal precision and scale mismatch 31b7ec1 [Bryan Cutler] Added check that BigDecimal precision and scale matches that of the vector 10cac9c [Bryan Cutler] added vector API for set and setSafe with BigDecimal c5e8fba [Bryan Cutler] minor tweaks to JsonFileWriter da11b4f [Bryan Cutler] removed debug line f4560d9 [Bryan Cutler] added Decimal JSON support, Java roundtrip unit tests
This change adds support to reading and writing Decimal type vectors from JSON files. Data values are written as encoded hex and padded with 0's up to 16 bytes.
Added roundtrip unit tests.