-
Notifications
You must be signed in to change notification settings - Fork 3k
Optimized spark vectorized read parquet decimal #3249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @rdblue @jackye1995 @nastra, could you help to review this? thanks a lot. |
|
@ConeyLiu I'll try to review this next week. In the meantime could you please run some benchmarks for this PR and attach the results here? So in your case it would be: |
|
before after I added decimal benchmark: after |
|
Running CI |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java
Outdated
Show resolved
Hide resolved
|
@ConeyLiu, this looks like an improvement, but I think that it relies too heavily on config properties that must be set the same way everywhere rather than detecting what should be done from inputs. |
|
thanks @rdblue for the review. I will do some refactoring. |
|
Hi @rdblue, please take a look again, thanks a lot. Also update benchmark results: |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
| break; | ||
| case UUID: | ||
| case FIXED_WIDTH_BINARY: | ||
| case FIXED_LENGTH_DECIMAL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, combined the FIXED_WIDTH_BINARY to use fixedSizeBinaryBatchReader instead of fixedWidthTypeBinaryBatchReader. fixedWidthTypeBinaryBatchReader uses VarBinaryVector, it should use FixedSizeBinaryVector. Please correct me if I am wrong @nastra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep those changes for FIXED_WIDTH_BINARY are correct. Note that support for FIXED is being added as part of #3029
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I'm not sure we can change stuff for FIXED_LENGTH_DECIMAL. See also
iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPageIterator.java
Lines 310 to 316 in 8058ec1
| /** | |
| * Method for reading a batch of decimals backed by fixed length byte array parquet data type. Arrow stores all | |
| * decimals in 16 bytes. This method provides the necessary padding to the decimals read. Moreover, Arrow interprets | |
| * the decimals in Arrow buffer as little endian. Parquet stores fixed length decimals as big endian. So, this method | |
| * uses {@link DecimalVector#setBigEndian(int, byte[])} method so that the data in Arrow vector is indeed little | |
| * endian. | |
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 362 in dbfa71e
| valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth); |
I think the
valuesReader.getBuffer(typeWidth) shoud already return a little endian?
public ByteBuffer getBuffer(int length) {
try {
return valuesInputStream.slice(length).order(ByteOrder.LITTLE_ENDIAN);
} catch (IOException e) {
throw new ParquetDecodingException("Failed to read " + length + " bytes", e);
}
}It seems like we don't need to DecimalVector#setBigEndian(int, byte[]).
And also, the UUID https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid stored as big endian as well.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
| break; | ||
| case UUID: | ||
| case FIXED_WIDTH_BINARY: | ||
| case FIXED_LENGTH_DECIMAL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep those changes for FIXED_WIDTH_BINARY are correct. Note that support for FIXED is being added as part of #3029
| public VectorizedArrowReader( | ||
| ColumnDescriptor desc, | ||
| Types.NestedField icebergField, | ||
| Types.NestedField originalIcebergField, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue do we need to keep the original constructor without the originalIcebergField or could it be removed?
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Show resolved
Hide resolved
|
@ConeyLiu, could you rebase this? I'll take another look so we can get it in. Thanks! |
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ConeyLiu, for this change, I also stumbled on something similar recently.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Show resolved
Hide resolved
|
Sorry for the delayed review, was focusing on getting 1.2 release out. I think it's mostly there, added a few comments. |
|
@nastra could you also take another look? |
|
Thanks, @jackye1995 @zhongyujiang for the review. Please take another look when you are free. |
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall this LGTM, can we remove IntBackedDecimalBatchReader / LongBackedDecimalBatchReader / FixedLengthDecimalBatchReader / IntBackedDecimalPageReader / LongBackedDecimalPageReader?
@ConeyLiu could you please run the benchmark on your GH fork and post the link to it? I usually run benchmarks on my fork here and you should have the same action in your fork.
|
@nastra benchmark results at here
Those methods/classes are public, is it safe to delete them? Or mark them as deprecated. |
|
Those classes should be safe to delete. Also there are no API guarantees on |
|
Yes I agree, we can just mark it as deprecated and remove it after the next release just to follow the process |
|
Thanks, @nastra @jackye1995 for your time. Those public methods/classes are marked as deprecated. |
|
@ConeyLiu I think we should also deprecate |
|
@nastra updated, and also deprecate |
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, @nastra do you have any further comments?
|
Thanks for the contribution! Thanks for the review @nastra ! |
|
Thanks all. |
Arrow use 16 bytes for all decimal vector, however, the data could be stored as int/long in parquet file for different precision decimal data. We only need to use the int/long arrow vector for int/long backed decimal data. This could improve performance a lot when we do a full table scan on the store_sales table(1TB data scale).
The existed UT should cover the vectorized read case. I could add the extra UTs if needed.