Skip to content

Conversation

@BryanCutler
Copy link
Member

This change adds support to reading and writing Decimal type vectors from JSON files. Data values are written as encoded hex and padded with 0's up to 16 bytes.

Added roundtrip unit tests.

ArrowType.Decimal type = (ArrowType.Decimal) vector.getField().getType();
BigDecimal decimalValue = new BigDecimal(BigInteger.valueOf(value), type.getScale());
DecimalUtility.writeBigDecimalToArrowBuf(decimalValue, vector.getBuffer(), index);
vector.getValidityVector().getMutator().setToOne(index);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This procedure to write BigDecimal values to a vector is pretty awkward

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I prefer that we add setSafe(index, decimalValue)

// Verify decimal 1 vector
BigDecimal readValue = decimalVector1.getAccessor().getObject(i);
ArrowType.Decimal type = (ArrowType.Decimal) decimalVector1.getField().getType();
BigDecimal genValue = new BigDecimal(BigInteger.valueOf(i), type.getScale());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be much cleaner to get the scale with #972 merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler is #972 just for testing purposes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that PR came from me adding Timestamp to Spark. I needed to check the time zone string given a NullableTimeStampMicroTZVector

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@BryanCutler
Copy link
Member Author

It ended up to be really awkward to write decimal values to the vector. Maybe there is an easier way I just missed? The DecimalVector API just has set(int index, ArrowBuf value) so the basic procedure I took assuming I have a Java BigDecimal value:

  1. Get a pointer to the vector buffer
  2. Use DecimalUtilty.writeBigDecimalToArrowBuf that converts the value to a byte array and writes to the buffer at the given index
  3. Need to set the validity vector bit to 1 for the given index

This is not very user friendly and there is lots of room for error - it also unsafe and assumes the vector already has the needed capacity. I'm assuming we don't want the vector API to directly accept a BigDecimal but what about a byte array? That would make things a little better as the DecimalUtility could convert the BigDecimal to a byte array, then any work on the vector buffer could be done internal and allow for a setSafe method.

What are your thoughts on the above @jacques-n @julienledem @icexelloss ?

@jacques-n
Copy link
Contributor

Definitely agree that the interface can be improved. I see no reason not to support BigDecimal as another interface. We'd just need to assert that precision/scale match expected. I'm not quite following the note on the validity vector and the capacity management as NullableDecimalVector.Mutator.setSafe* methods have numerous different options:

setSafe(int index, int isSet, int startField, ArrowBuf bufferField )
setSafe(int index, NullableDecimalHolder value)
setSafe(int index, DecimalHolder value)
setSafe(int index, ArrowBuf value)

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that decimals are always 16 bytes in memory in Java. In C++ at the moment we determine the byte width based on the precision. I think we can make all decimals 16-bytes (versus 4, 8, and 16 depending on the precision) but it would require refactoring, so we should discuss. It's too bad we didn't dig into this sooner. cc @cpcloud

@BryanCutler
Copy link
Member Author

Hi @jacques-n , thanks for your response.

Regarding the NullableDecimalVector setSafe methods, my point was that DecimalUtility does not use these, but writes directly to an ArrowBuf. While it would be possible to create a new ArrowBuf, write the BigDecimal value to it and then call setSafe(int index, ArrowBuf value), it is not very efficient because the value is actually copied 3 times, using an intermediate byte array.

Instead, it is more efficient to write the decimal value right to the inner vector data buffer, e.g NullableDecimalVector.getDataBuffer(), but that requires that the buffer has enough capacity and also the validity bit needs to be set after.

Are you saying it would be ok to add the following methods to NullableDecimalVector and DecimalVector?

set(int index, BigDecimal value)
setSafe(int index, BigDecimal value)

If not, I could probably add these type of functions to the DecimalUtility to make it more user friendly at least.

@BryanCutler
Copy link
Member Author

It appears that decimals are always 16 bytes in memory in Java

@wesm , I could look into adding support for 8 and 4 bytes values either before or after this PR is done so we could get integration fully working for decimals?

@icexelloss
Copy link
Contributor

It feels awkward to me to have to use DecimalUtitlity to write a BigDecimal object to the vector. I don't see reason why we don't add:

set(int index, BigDecimal value)
setSafe(int index, BigDecimal value)

Mutator mutator = valueVector.getMutator();

int innerVectorCount = vectorType.equals(OFFSET) ? count + 1 : count;
valueVector.setInitialCapacity(innerVectorCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the vectors do not use setSafe so reading beyond the default capacity would cause problems. Maybe for cases where the majority of values are null this is overdoing it, but I think that is less common.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

case VARBINARY:
String hexString = Hex.encodeHexString(((VarBinaryVector) valueVector).getAccessor().get(i));
generator.writeObject(hexString);
case VARBINARY: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the brackets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's there to limit the scope of the variables declared in the case block to that case only. Since now there are 2 blocks decoding hex values, just to prevent using the wrong variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BryanCutler
Copy link
Member Author

Thanks for the review @icexelloss . I'll try out adding set methods for BigDecimal to the vector classes, but I do wonder if maybe there was a design reason to not do this. Sort of like how the date/time vectors do not have and set methods with friendly classes, but rely on the DateTimeUtility to convert the time values to the required form (like a long value).

@BryanCutler
Copy link
Member Author

I updated with adding these methods the Decimal vector APIs

set(int index, BigDecimal value)
setSafe(int index, BigDecimal value)

and writeDecimal(BigDecimal value) to the writer API

}

public void set(int index, ${friendlyType} value){
DecimalUtility.writeBigDecimalToArrowBuf(value, data, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should verify precision and scale

}

public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int startIndex, int scale) {
public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int index, int scale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the consistency. Can you add the doc explaining index is not the byte index, but the value index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, adding some docs would be nice here

values[i] = decimal;
decimalVector.getMutator().setIndexDefined(i);
DecimalUtility.writeBigDecimalToArrowBuf(decimal, decimalVector.getBuffer(), i);
decimalVector.getMutator().setSafe(i, decimal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case to cover precision/scale mismatch with the setSafe(int, BigDecimal) method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What mismatch to you mean exactly - like if the BigDecimal precision is more than Arrow supports?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think,

For scale, we should check BigDecimal.scale() equals the vector.scale

For precision, we should check BigDecimal.precision() <= vector.precision

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right definitely. I'll add that thanks!

}

public static void writeBigDecimalToArrowBuf(BigDecimal value, ArrowBuf bytebuf, int index) {
final byte[] bytes = value.unscaledValue().toByteArray();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with @cpcloud offline. This is consistent with the C++ implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking on that

@BryanCutler
Copy link
Member Author

Updated with check for BigDecimal mismatch and tests, and added docs to DecimalUtility

@icexelloss
Copy link
Contributor

@BryanCutler Thank you! LGTM.

@jacques-n Do you want to take another look at the setSafe(BigDecimal) method?

@cpcloud
Copy link
Contributor

cpcloud commented Sep 3, 2017

Is there anything preventing this from being merged?

@wesm
Copy link
Member

wesm commented Sep 3, 2017

I don't think so, let me take a quick look and then merge if all looks good

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, merging so that work on integration tests can proceed. If @jacques-n or @julienledem could take a last look to make sure nothing is amiss with the new methods here in case we need to fix before 0.7.0 goes out

@asfgit asfgit closed this in 08b41f9 Sep 3, 2017
@wesm
Copy link
Member

wesm commented Sep 3, 2017

thanks @BryanCutler and @icexelloss and others!

@BryanCutler BryanCutler deleted the java-json-decimal-support-ARROW-1238 branch November 19, 2018 05:48
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
This change adds support to reading and writing Decimal type vectors from JSON files.  Data values are written as encoded hex and padded with 0's up to 16 bytes.

Added roundtrip unit tests.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#994 from BryanCutler/java-json-decimal-support-ARROW-1238 and squashes the following commits:

28c1e3e [Bryan Cutler] added test for BigDecimal precision and scale mismatch
31b7ec1 [Bryan Cutler] Added check that BigDecimal precision and scale matches that of the vector
10cac9c [Bryan Cutler] added vector API for set and setSafe with BigDecimal
c5e8fba [Bryan Cutler] minor tweaks to JsonFileWriter
da11b4f [Bryan Cutler] removed debug line
f4560d9 [Bryan Cutler] added Decimal JSON support, Java roundtrip unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants