-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-44345: [C++][Parquet] Fully support arrow decimal32/64 in Parquet #45351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
cba31c7
Support decimal32/64 in schema conversion
curioustien a9398a2
Support decimal32/64 in column writer
curioustien e1dc023
Restrict column writer with correct decimal types
curioustien 6032b02
Support decimal32/64 in reader & vector kernels & tests
curioustien 290de24
Pyarrow parquet to pandas
curioustien e5b996e
Address comments
curioustien 44f1adc
Add more tests in arrow_schema_test
curioustien c017323
Add more tests in arrow_reader_writer_test
curioustien 63d307b
Add more typed tests for small decimals
curioustien 77dd7d3
Document new flag
curioustien d81cf13
Add decimal32/64 list type support arrow to pandas
curioustien 424472f
Support smallest_decimal_enabled flag in pyarrow
curioustien d1687a7
Revert writer schema manifest arg passing change
curioustien 1f0fb7b
Merge remote-tracking branch 'upstream/main' into parquet-decimal-test
curioustien 52711d5
Fix lint
curioustien f64d6d9
Remove extra doc
curioustien 3fb307e
Revert FileReader changes
curioustien f279349
Delay scratch buffer pointer cast
curioustien 8a78c72
Use ArrowReaderProperties
curioustien 29e98ff
Merge remote-tracking branch 'upstream/main' into parquet-decimal-test
curioustien d2e1ffa
Revert "Delay scratch buffer pointer cast"
curioustien a8304f3
Remove mistake include
curioustien de295e3
Merge remote-tracking branch 'upstream/main' into parquet-decimal-test
curioustien File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,7 @@ | |
| #include <vector> | ||
|
|
||
| #include "arrow/array.h" | ||
| #include "arrow/array/array_decimal.h" | ||
| #include "arrow/compute/api.h" | ||
| #include "arrow/datum.h" | ||
| #include "arrow/io/memory.h" | ||
|
|
@@ -37,6 +38,7 @@ | |
| #include "arrow/status.h" | ||
| #include "arrow/table.h" | ||
| #include "arrow/type.h" | ||
| #include "arrow/type_fwd.h" | ||
| #include "arrow/type_traits.h" | ||
| #include "arrow/util/base64.h" | ||
| #include "arrow/util/bit_util.h" | ||
|
|
@@ -69,6 +71,12 @@ using arrow::Decimal128Type; | |
| using arrow::Decimal256; | ||
| using arrow::Decimal256Array; | ||
| using arrow::Decimal256Type; | ||
| using arrow::Decimal32; | ||
| using arrow::Decimal32Array; | ||
| using arrow::Decimal32Type; | ||
| using arrow::Decimal64; | ||
| using arrow::Decimal64Array; | ||
| using arrow::Decimal64Type; | ||
| using arrow::Field; | ||
| using arrow::Int32Array; | ||
| using arrow::ListArray; | ||
|
|
@@ -590,7 +598,8 @@ Status TransferBinary(RecordReader* reader, MemoryPool* pool, | |
| } | ||
|
|
||
| // ---------------------------------------------------------------------- | ||
| // INT32 / INT64 / BYTE_ARRAY / FIXED_LEN_BYTE_ARRAY -> Decimal128 || Decimal256 | ||
| // INT32 / INT64 / BYTE_ARRAY / FIXED_LEN_BYTE_ARRAY | ||
| // -> Decimal32 || Decimal64 || Decimal128 || Decimal256 | ||
|
|
||
| template <typename DecimalType> | ||
| Status RawBytesToDecimalBytes(const uint8_t* value, int32_t byte_width, | ||
|
|
@@ -603,6 +612,16 @@ Status RawBytesToDecimalBytes(const uint8_t* value, int32_t byte_width, | |
| template <typename DecimalArrayType> | ||
| struct DecimalTypeTrait; | ||
|
|
||
| template <> | ||
| struct DecimalTypeTrait<::arrow::Decimal32Array> { | ||
| using value = ::arrow::Decimal32; | ||
| }; | ||
|
|
||
| template <> | ||
| struct DecimalTypeTrait<::arrow::Decimal64Array> { | ||
| using value = ::arrow::Decimal64; | ||
| }; | ||
|
|
||
| template <> | ||
| struct DecimalTypeTrait<::arrow::Decimal128Array> { | ||
| using value = ::arrow::Decimal128; | ||
|
|
@@ -721,7 +740,7 @@ struct DecimalConverter<DecimalArrayType, ByteArrayType> { | |
| } | ||
| }; | ||
|
|
||
| /// \brief Convert an Int32 or Int64 array into a Decimal128Array | ||
| /// \brief Convert an Int32 or Int64 array into a Decimal32/64/128/256Array | ||
| /// The parquet spec allows systems to write decimals in int32, int64 if the values are | ||
| /// small enough to fit in less 4 bytes or less than 8 bytes, respectively. | ||
| /// This function implements the conversion from int32 and int64 arrays to decimal arrays. | ||
|
|
@@ -731,9 +750,11 @@ template < | |
| std::is_same<ParquetIntegerType, Int64Type>::value>> | ||
| static Status DecimalIntegerTransfer(RecordReader* reader, MemoryPool* pool, | ||
| const std::shared_ptr<Field>& field, Datum* out) { | ||
| // Decimal128 and Decimal256 are only Arrow constructs. Parquet does not | ||
| // Decimal32 and Decimal64 are only Arrow constructs. Parquet does not | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The comment seems not correct? |
||
| // specifically distinguish between decimal byte widths. | ||
| DCHECK(field->type()->id() == ::arrow::Type::DECIMAL128 || | ||
| DCHECK(field->type()->id() == ::arrow::Type::DECIMAL32 || | ||
| field->type()->id() == ::arrow::Type::DECIMAL64 || | ||
| field->type()->id() == ::arrow::Type::DECIMAL128 || | ||
| field->type()->id() == ::arrow::Type::DECIMAL256); | ||
|
|
||
| const int64_t length = reader->values_written(); | ||
|
|
@@ -757,7 +778,13 @@ static Status DecimalIntegerTransfer(RecordReader* reader, MemoryPool* pool, | |
| // sign/zero extend int32_t values, otherwise a no-op | ||
| const auto value = static_cast<int64_t>(values[i]); | ||
|
|
||
| if constexpr (std::is_same_v<DecimalArrayType, Decimal128Array>) { | ||
| if constexpr (std::is_same_v<DecimalArrayType, Decimal32Array>) { | ||
| ::arrow::Decimal32 decimal(value); | ||
| decimal.ToBytes(out_ptr); | ||
| } else if constexpr (std::is_same_v<DecimalArrayType, Decimal64Array>) { | ||
| ::arrow::Decimal64 decimal(value); | ||
| decimal.ToBytes(out_ptr); | ||
| } else if constexpr (std::is_same_v<DecimalArrayType, Decimal128Array>) { | ||
| ::arrow::Decimal128 decimal(value); | ||
| decimal.ToBytes(out_ptr); | ||
| } else { | ||
|
|
@@ -900,14 +927,58 @@ Status TransferColumnData(RecordReader* reader, | |
| } | ||
| RETURN_NOT_OK(TransferHalfFloat(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::arrow::Type::DECIMAL32: { | ||
| switch (descr->physical_type()) { | ||
| case ::parquet::Type::INT32: { | ||
| auto fn = DecimalIntegerTransfer<Decimal32Array, Int32Type>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::BYTE_ARRAY: { | ||
| auto fn = &TransferDecimal<Decimal32Array, ByteArrayType>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::FIXED_LEN_BYTE_ARRAY: { | ||
| auto fn = &TransferDecimal<Decimal32Array, FLBAType>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| default: | ||
| return Status::Invalid( | ||
| "Physical type for decimal32 must be int32, byte array, or fixed length " | ||
| "binary"); | ||
| } | ||
| } break; | ||
| case ::arrow::Type::DECIMAL64: { | ||
| switch (descr->physical_type()) { | ||
| case ::parquet::Type::INT32: { | ||
| auto fn = DecimalIntegerTransfer<Decimal64Array, Int32Type>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::INT64: { | ||
| auto fn = DecimalIntegerTransfer<Decimal64Array, Int64Type>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::BYTE_ARRAY: { | ||
| auto fn = &TransferDecimal<Decimal64Array, ByteArrayType>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::FIXED_LEN_BYTE_ARRAY: { | ||
| auto fn = &TransferDecimal<Decimal64Array, FLBAType>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| default: | ||
| return Status::Invalid( | ||
| "Physical type for decimal64 must be int32, int64, byte array, or fixed " | ||
| "length binary"); | ||
| } | ||
| } break; | ||
| case ::arrow::Type::DECIMAL128: { | ||
| switch (descr->physical_type()) { | ||
| case ::parquet::Type::INT32: { | ||
| auto fn = DecimalIntegerTransfer<Decimal128Array, Int32Type>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::INT64: { | ||
| auto fn = &DecimalIntegerTransfer<Decimal128Array, Int64Type>; | ||
| auto fn = DecimalIntegerTransfer<Decimal128Array, Int64Type>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::BYTE_ARRAY: { | ||
|
|
@@ -924,14 +995,14 @@ Status TransferColumnData(RecordReader* reader, | |
| "length binary"); | ||
| } | ||
| } break; | ||
| case ::arrow::Type::DECIMAL256: | ||
| case ::arrow::Type::DECIMAL256: { | ||
| switch (descr->physical_type()) { | ||
| case ::parquet::Type::INT32: { | ||
| auto fn = DecimalIntegerTransfer<Decimal256Array, Int32Type>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::INT64: { | ||
| auto fn = &DecimalIntegerTransfer<Decimal256Array, Int64Type>; | ||
| auto fn = DecimalIntegerTransfer<Decimal256Array, Int64Type>; | ||
| RETURN_NOT_OK(fn(reader, pool, value_field, &result)); | ||
| } break; | ||
| case ::parquet::Type::BYTE_ARRAY: { | ||
|
|
@@ -947,8 +1018,7 @@ Status TransferColumnData(RecordReader* reader, | |
| "Physical type for decimal256 must be int32, int64, byte array, or fixed " | ||
| "length binary"); | ||
| } | ||
| break; | ||
|
|
||
| } break; | ||
| case ::arrow::Type::TIMESTAMP: { | ||
| const ::arrow::TimestampType& timestamp_type = | ||
| checked_cast<::arrow::TimestampType&>(*value_field->type()); | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes required to the compute kernels required to support Parquet? I can't see why but I might be missing something. Otherwise, we should move adding support for decimal32 and decimal64 to those compute kernels on a different PR and leave this one only with the required parquet changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see now, on the description says this is required for some tests:
Allow decimal32/64 in Arrow compute vector hash which is needed for some of the existing Parquet testsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm down to split this change to another PR which can cover this support with more tests on the arrow compute side. But yes, there are a few tests in Parquet that hit arrow vector kernel code path