-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13573: [C++] Support dictionaries natively in case_when #11022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
06b428d
ARROW-13573: [C++] Add DictScalarFromJSON
lidavidm 2ec1132
ARROW-13573: [C++] Check that dictionary array has dictionary
lidavidm e3b7f93
ARROW-13573: [C++] Handle simple dictionary cases
lidavidm 7a57c91
ARROW-13573: [C++] Transpose dictionaries in case_when
lidavidm 17230ee
ARROW-13573: [C++] Handle nested dictionaries
lidavidm 16fe210
ARROW-13691: [C++] Rebase
lidavidm 8e0e333
ARROW-13573: [C++] Always unify dictionaries
lidavidm 60ffb02
ARROW-13573: [C++] Handle nulls before unifying, refactor
lidavidm a10888c
ARROW-13573: [C++] Test dictionaries with nulls
lidavidm 5cbe6d5
ARROW-13573: [C++] Address feedback
lidavidm 345388f
ARROW-13573: [C++] Add a direct test of dispatch
lidavidm 8abb93f
ARROW-13573: [C++] Fix mistakes
lidavidm 45563d1
ARROW-13573: [C++] Fix undefined behavior
lidavidm 5fc1a1f
ARROW-13573: [C++] See if turning off unity builds fixes R CI
lidavidm f2a0a9e
ARROW-13573: [C++] Try bumping timeout
lidavidm a5b6078
ARROW-13573: [C++] Should fix MinGW32
lidavidm 29a2f87
ARROW-13573: [C++] Make CMAKE_UNITY_BUILD depend on the rtools version
lidavidm d81773d
ARROW-13573: [C++] RTools40 build is very slow without unity build
lidavidm 15a64a2
ARROW-13573: [C++] Add clarifying comments
lidavidm 26230a4
ARROW-13573: [C++] Address feedback
lidavidm ba39d83
ARROW-13573: [C++] Explicitly indicate when we expect dictionary-enco…
lidavidm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,7 @@ | |
| #include "arrow/util/decimal.h" | ||
| #include "arrow/util/macros.h" | ||
| #include "arrow/util/visibility.h" | ||
| #include "arrow/visitor_inline.h" | ||
|
|
||
| namespace arrow { | ||
|
|
||
|
|
@@ -97,6 +98,17 @@ class ARROW_EXPORT DictionaryMemoTable { | |
| Status GetOrInsert(const UInt16Type*, uint16_t value, int32_t* out); | ||
| Status GetOrInsert(const UInt32Type*, uint32_t value, int32_t* out); | ||
| Status GetOrInsert(const UInt64Type*, uint64_t value, int32_t* out); | ||
| Status GetOrInsert(const DurationType*, int64_t value, int32_t* out); | ||
| Status GetOrInsert(const TimestampType*, int64_t value, int32_t* out); | ||
| Status GetOrInsert(const Date32Type*, int32_t value, int32_t* out); | ||
| Status GetOrInsert(const Date64Type*, int64_t value, int32_t* out); | ||
| Status GetOrInsert(const Time32Type*, int32_t value, int32_t* out); | ||
| Status GetOrInsert(const Time64Type*, int64_t value, int32_t* out); | ||
| Status GetOrInsert(const MonthDayNanoIntervalType*, | ||
| MonthDayNanoIntervalType::MonthDayNanos value, int32_t* out); | ||
| Status GetOrInsert(const DayTimeIntervalType*, | ||
| DayTimeIntervalType::DayMilliseconds value, int32_t* out); | ||
| Status GetOrInsert(const MonthIntervalType*, int32_t value, int32_t* out); | ||
| Status GetOrInsert(const FloatType*, float value, int32_t* out); | ||
| Status GetOrInsert(const DoubleType*, double value, int32_t* out); | ||
|
|
||
|
|
@@ -282,6 +294,73 @@ class DictionaryBuilderBase : public ArrayBuilder { | |
| return indices_builder_.AppendEmptyValues(length); | ||
| } | ||
|
|
||
| Status AppendScalar(const Scalar& scalar, int64_t n_repeats) override { | ||
| if (!scalar.is_valid) return AppendNulls(n_repeats); | ||
|
|
||
| const auto& dict_ty = internal::checked_cast<const DictionaryType&>(*scalar.type); | ||
| const DictionaryScalar& dict_scalar = | ||
| internal::checked_cast<const DictionaryScalar&>(scalar); | ||
| const auto& dict = internal::checked_cast<const typename TypeTraits<T>::ArrayType&>( | ||
| *dict_scalar.value.dictionary); | ||
| ARROW_RETURN_NOT_OK(Reserve(n_repeats)); | ||
| switch (dict_ty.index_type()->id()) { | ||
| case Type::UINT8: | ||
| return AppendScalarImpl<UInt8Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| case Type::INT8: | ||
| return AppendScalarImpl<Int8Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| case Type::UINT16: | ||
| return AppendScalarImpl<UInt16Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| case Type::INT16: | ||
| return AppendScalarImpl<Int16Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| case Type::UINT32: | ||
| return AppendScalarImpl<UInt32Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| case Type::INT32: | ||
| return AppendScalarImpl<Int32Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| case Type::UINT64: | ||
| return AppendScalarImpl<UInt64Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| case Type::INT64: | ||
| return AppendScalarImpl<Int64Type>(dict, *dict_scalar.value.index, n_repeats); | ||
| default: | ||
| return Status::TypeError("Invalid index type: ", dict_ty); | ||
| } | ||
| return Status::OK(); | ||
| } | ||
|
|
||
| Status AppendScalars(const ScalarVector& scalars) override { | ||
| for (const auto& scalar : scalars) { | ||
| ARROW_RETURN_NOT_OK(AppendScalar(*scalar, /*n_repeats=*/1)); | ||
| } | ||
| return Status::OK(); | ||
| } | ||
|
|
||
| Status AppendArraySlice(const ArrayData& array, int64_t offset, int64_t length) final { | ||
| // Visit the indices and insert the unpacked values. | ||
| const auto& dict_ty = internal::checked_cast<const DictionaryType&>(*array.type); | ||
| const typename TypeTraits<T>::ArrayType dict(array.dictionary); | ||
| ARROW_RETURN_NOT_OK(Reserve(length)); | ||
| switch (dict_ty.index_type()->id()) { | ||
pitrou marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| case Type::UINT8: | ||
| return AppendArraySliceImpl<uint8_t>(dict, array, offset, length); | ||
| case Type::INT8: | ||
| return AppendArraySliceImpl<int8_t>(dict, array, offset, length); | ||
| case Type::UINT16: | ||
| return AppendArraySliceImpl<uint16_t>(dict, array, offset, length); | ||
| case Type::INT16: | ||
| return AppendArraySliceImpl<int16_t>(dict, array, offset, length); | ||
| case Type::UINT32: | ||
| return AppendArraySliceImpl<uint32_t>(dict, array, offset, length); | ||
| case Type::INT32: | ||
| return AppendArraySliceImpl<int32_t>(dict, array, offset, length); | ||
| case Type::UINT64: | ||
| return AppendArraySliceImpl<uint64_t>(dict, array, offset, length); | ||
| case Type::INT64: | ||
| return AppendArraySliceImpl<int64_t>(dict, array, offset, length); | ||
| default: | ||
| return Status::TypeError("Invalid index type: ", dict_ty); | ||
| } | ||
| return Status::OK(); | ||
| } | ||
|
|
||
| /// \brief Insert values into the dictionary's memo, but do not append any | ||
| /// indices. Can be used to initialize a new builder with known dictionary | ||
| /// values | ||
|
|
@@ -376,6 +455,37 @@ class DictionaryBuilderBase : public ArrayBuilder { | |
| } | ||
|
|
||
| protected: | ||
| template <typename c_type> | ||
| Status AppendArraySliceImpl(const typename TypeTraits<T>::ArrayType& dict, | ||
| const ArrayData& array, int64_t offset, int64_t length) { | ||
| const c_type* values = array.GetValues<c_type>(1) + offset; | ||
| return VisitBitBlocks( | ||
| array.buffers[0], array.offset + offset, length, | ||
| [&](const int64_t position) { | ||
| const int64_t index = static_cast<int64_t>(values[position]); | ||
| if (dict.IsValid(index)) { | ||
| return Append(dict.GetView(index)); | ||
| } | ||
| return AppendNull(); | ||
| }, | ||
| [&]() { return AppendNull(); }); | ||
| } | ||
|
|
||
| template <typename IndexType> | ||
| Status AppendScalarImpl(const typename TypeTraits<T>::ArrayType& dict, | ||
| const Scalar& index_scalar, int64_t n_repeats) { | ||
| using ScalarType = typename TypeTraits<IndexType>::ScalarType; | ||
| const auto index = internal::checked_cast<const ScalarType&>(index_scalar).value; | ||
| if (index_scalar.is_valid && dict.IsValid(index)) { | ||
| const auto& value = dict.GetView(index); | ||
| for (int64_t i = 0; i < n_repeats; i++) { | ||
| ARROW_RETURN_NOT_OK(Append(value)); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not for this PR, but it sounds like offering a two-step API on DictionaryBuilder would allow for performance improvements: /// Ensure `value` is in the dict, and return its index, but doesn't append it
Result<int64_t> Encode(c_type value);
/// Append the given dictionary index
Status AppendIndex(int64_t index);
Status AppendIndices(int64_t index, int64_t nrepeats);
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I filed ARROW-14042. |
||
| } | ||
| return Status::OK(); | ||
| } | ||
| return AppendNulls(n_repeats); | ||
| } | ||
|
|
||
| Status FinishInternal(std::shared_ptr<ArrayData>* out) override { | ||
| std::shared_ptr<ArrayData> dictionary; | ||
| ARROW_RETURN_NOT_OK(FinishWithDictOffset(/*offset=*/0, out, &dictionary)); | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean it segfaults during compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It segfaults in the tests. I wasn't really able to debug this on Windows; it disappears once you build with debuginfo.