ARROW-1238: [Java] Adding Decimal type JSON read and write support #994

BryanCutler · 2017-08-25T23:05:51Z

This change adds support to reading and writing Decimal type vectors from JSON files. Data values are written as encoded hex and padded with 0's up to 16 bytes.

Added roundtrip unit tests.

BryanCutler · 2017-08-25T23:08:50Z

java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java

+    ArrowType.Decimal type = (ArrowType.Decimal) vector.getField().getType();
+    BigDecimal decimalValue = new BigDecimal(BigInteger.valueOf(value), type.getScale());
+    DecimalUtility.writeBigDecimalToArrowBuf(decimalValue, vector.getBuffer(), index);
+    vector.getValidityVector().getMutator().setToOne(index);


This procedure to write BigDecimal values to a vector is pretty awkward

I agree. I prefer that we add setSafe(index, decimalValue)

BryanCutler · 2017-08-25T23:11:49Z

java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java

+      // Verify decimal 1 vector
+      BigDecimal readValue = decimalVector1.getAccessor().getObject(i);
+      ArrowType.Decimal type = (ArrowType.Decimal) decimalVector1.getField().getType();
+      BigDecimal genValue = new BigDecimal(BigInteger.valueOf(i), type.getScale());


it would be much cleaner to get the scale with #972 merged

@BryanCutler is #972 just for testing purposes?

No, that PR came from me adding Timestamp to Spark. I needed to check the time zone string given a NullableTimeStampMicroTZVector

BryanCutler · 2017-08-25T23:33:14Z

It ended up to be really awkward to write decimal values to the vector. Maybe there is an easier way I just missed? The DecimalVector API just has set(int index, ArrowBuf value) so the basic procedure I took assuming I have a Java BigDecimal value:

Get a pointer to the vector buffer
Use DecimalUtilty.writeBigDecimalToArrowBuf that converts the value to a byte array and writes to the buffer at the given index
Need to set the validity vector bit to 1 for the given index

This is not very user friendly and there is lots of room for error - it also unsafe and assumes the vector already has the needed capacity. I'm assuming we don't want the vector API to directly accept a BigDecimal but what about a byte array? That would make things a little better as the DecimalUtility could convert the BigDecimal to a byte array, then any work on the vector buffer could be done internal and allow for a setSafe method.

What are your thoughts on the above @jacques-n @julienledem @icexelloss ?

jacques-n · 2017-08-26T16:19:45Z

Definitely agree that the interface can be improved. I see no reason not to support BigDecimal as another interface. We'd just need to assert that precision/scale match expected. I'm not quite following the note on the validity vector and the capacity management as NullableDecimalVector.Mutator.setSafe* methods have numerous different options:

setSafe(int index, int isSet, int startField, ArrowBuf bufferField )
setSafe(int index, NullableDecimalHolder value)
setSafe(int index, DecimalHolder value)
setSafe(int index, ArrowBuf value)

wesm

It appears that decimals are always 16 bytes in memory in Java. In C++ at the moment we determine the byte width based on the precision. I think we can make all decimals 16-bytes (versus 4, 8, and 16 depending on the precision) but it would require refactoring, so we should discuss. It's too bad we didn't dig into this sooner. cc @cpcloud

BryanCutler · 2017-08-29T03:51:51Z

Hi @jacques-n , thanks for your response.

Regarding the NullableDecimalVector setSafe methods, my point was that DecimalUtility does not use these, but writes directly to an ArrowBuf. While it would be possible to create a new ArrowBuf, write the BigDecimal value to it and then call setSafe(int index, ArrowBuf value), it is not very efficient because the value is actually copied 3 times, using an intermediate byte array.

Instead, it is more efficient to write the decimal value right to the inner vector data buffer, e.g NullableDecimalVector.getDataBuffer(), but that requires that the buffer has enough capacity and also the validity bit needs to be set after.

Are you saying it would be ok to add the following methods to NullableDecimalVector and DecimalVector?

set(int index, BigDecimal value)
setSafe(int index, BigDecimal value)

If not, I could probably add these type of functions to the DecimalUtility to make it more user friendly at least.

BryanCutler · 2017-08-29T03:54:31Z

It appears that decimals are always 16 bytes in memory in Java

@wesm , I could look into adding support for 8 and 4 bytes values either before or after this PR is done so we could get integration fully working for decimals?

icexelloss · 2017-08-29T15:22:15Z

It feels awkward to me to have to use DecimalUtitlity to write a BigDecimal object to the vector. I don't see reason why we don't add:

set(int index, BigDecimal value)
setSafe(int index, BigDecimal value)

icexelloss · 2017-08-29T16:24:59Z

java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java

-        Mutator mutator = valueVector.getMutator();

        int innerVectorCount = vectorType.equals(OFFSET) ? count + 1 : count;
+        valueVector.setInitialCapacity(innerVectorCount);


Most of the vectors do not use setSafe so reading beyond the default capacity would cause problems. Maybe for cases where the majority of values are null this is overdoing it, but I think that is less common.

icexelloss · 2017-08-29T16:27:15Z

java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java

-      case VARBINARY:
-        String hexString = Hex.encodeHexString(((VarBinaryVector) valueVector).getAccessor().get(i));
-        generator.writeObject(hexString);
+      case VARBINARY: {


Should we remove the brackets?

It's there to limit the scope of the variables declared in the case block to that case only. Since now there are 2 blocks decoding hex values, just to prevent using the wrong variables.

BryanCutler · 2017-08-29T18:18:46Z

Thanks for the review @icexelloss . I'll try out adding set methods for BigDecimal to the vector classes, but I do wonder if maybe there was a design reason to not do this. Sort of like how the date/time vectors do not have and set methods with friendly classes, but rely on the DateTimeUtility to convert the time values to the required form (like a long value).

BryanCutler · 2017-08-31T00:17:54Z

I updated with adding these methods the Decimal vector APIs

set(int index, BigDecimal value)
setSafe(int index, BigDecimal value)

and writeDecimal(BigDecimal value) to the writer API

icexelloss · 2017-08-31T13:52:27Z

java/vector/src/main/codegen/templates/FixedValueVectors.java

   }

+   public void set(int index, ${friendlyType} value){
+     DecimalUtility.writeBigDecimalToArrowBuf(value, data, index);


We should verify precision and scale

icexelloss · 2017-08-31T14:00:08Z

java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java

  }

-  public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int startIndex, int scale) {
+  public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int index, int scale) {


Thanks for fixing the consistency. Can you add the doc explaining index is not the byte index, but the value index?

Yeah, adding some docs would be nice here

icexelloss · 2017-08-31T14:11:43Z

java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java

      values[i] = decimal;
-      decimalVector.getMutator().setIndexDefined(i);
-      DecimalUtility.writeBigDecimalToArrowBuf(decimal, decimalVector.getBuffer(), i);
+      decimalVector.getMutator().setSafe(i, decimal);


Can you add a test case to cover precision/scale mismatch with the setSafe(int, BigDecimal) method?

What mismatch to you mean exactly - like if the BigDecimal precision is more than Arrow supports?

I think,

For scale, we should check BigDecimal.scale() equals the vector.scale

For precision, we should check BigDecimal.precision() <= vector.precision

Ah, right definitely. I'll add that thanks!

icexelloss · 2017-08-31T14:24:51Z

java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java

+  }
+
  public static void writeBigDecimalToArrowBuf(BigDecimal value, ArrowBuf bytebuf, int index) {
    final byte[] bytes = value.unscaledValue().toByteArray();


Are there issues with endianness here? Java is big endian: https://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#toByteArray() but Arrow is little endian? https://github.com/apache/arrow/blob/master/format/Layout.md#byte-order-endianness

cc @cpcloud

Checked with @cpcloud offline. This is consistent with the C++ implementation.

Thanks for checking on that

…ector

BryanCutler · 2017-09-01T22:20:30Z

Updated with check for BigDecimal mismatch and tests, and added docs to DecimalUtility

icexelloss · 2017-09-01T22:38:48Z

@BryanCutler Thank you! LGTM.

@jacques-n Do you want to take another look at the setSafe(BigDecimal) method?

cpcloud · 2017-09-03T19:25:57Z

Is there anything preventing this from being merged?

wesm · 2017-09-03T19:51:48Z

I don't think so, let me take a quick look and then merge if all looks good

wesm

+1, merging so that work on integration tests can proceed. If @jacques-n or @julienledem could take a last look to make sure nothing is amiss with the new methods here in case we need to fix before 0.7.0 goes out

wesm · 2017-09-03T19:55:15Z

thanks @BryanCutler and @icexelloss and others!

This change adds support to reading and writing Decimal type vectors from JSON files. Data values are written as encoded hex and padded with 0's up to 16 bytes. Added roundtrip unit tests. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#994 from BryanCutler/java-json-decimal-support-ARROW-1238 and squashes the following commits: 28c1e3e [Bryan Cutler] added test for BigDecimal precision and scale mismatch 31b7ec1 [Bryan Cutler] Added check that BigDecimal precision and scale matches that of the vector 10cac9c [Bryan Cutler] added vector API for set and setSafe with BigDecimal c5e8fba [Bryan Cutler] minor tweaks to JsonFileWriter da11b4f [Bryan Cutler] removed debug line f4560d9 [Bryan Cutler] added Decimal JSON support, Java roundtrip unit tests

added Decimal JSON support, Java roundtrip unit tests

f4560d9

BryanCutler commented Aug 25, 2017

View reviewed changes

removed debug line

da11b4f

BryanCutler commented Aug 25, 2017

View reviewed changes

wesm reviewed Aug 27, 2017

View reviewed changes

icexelloss reviewed Aug 29, 2017

View reviewed changes

BryanCutler added 2 commits August 29, 2017 11:41

minor tweaks to JsonFileWriter

c5e8fba

added vector API for set and setSafe with BigDecimal

10cac9c

icexelloss reviewed Aug 31, 2017

View reviewed changes

BryanCutler added 2 commits September 1, 2017 14:46

Added check that BigDecimal precision and scale matches that of the v…

31b7ec1

…ector

added test for BigDecimal precision and scale mismatch

28c1e3e

wesm approved these changes Sep 3, 2017

View reviewed changes

asfgit closed this in 08b41f9 Sep 3, 2017

BryanCutler deleted the java-json-decimal-support-ARROW-1238 branch November 19, 2018 05:48

ARROW-1238: [Java] Adding Decimal type JSON read and write support #994

ARROW-1238: [Java] Adding Decimal type JSON read and write support #994

Uh oh!

Conversation

BryanCutler commented Aug 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Aug 25, 2017

Uh oh!

jacques-n commented Aug 26, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Aug 29, 2017

Uh oh!

BryanCutler commented Aug 29, 2017

Uh oh!

icexelloss commented Aug 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Aug 29, 2017

Uh oh!

BryanCutler commented Aug 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Sep 1, 2017

Uh oh!

icexelloss commented Sep 1, 2017

Uh oh!

cpcloud commented Sep 3, 2017

Uh oh!

wesm commented Sep 3, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Sep 3, 2017

Uh oh!