-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13882: [C++] Improve min_max/hash_min_max type support #11111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
In light of the proposal to clarify how DayTime intervals are handled some changes here should probably be reverted. |
234913b to
f49f7cf
Compare
|
(Removed support for DayTimeInterval since it's only partially orderable.) |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you. Just a couple questions and nits.
cpp/src/arrow/array/validate.cc
Outdated
| @@ -498,6 +498,11 @@ struct ValidateArrayFullImpl { | |||
|
|
|||
| Status Visit(const StructType& type) { | |||
| // Validate children | |||
| if (data.child_data.size() != static_cast<size_t>(type.num_fields())) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised. This should already be checked by the ValidateArray top-level function, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Odd, I do remember getting a crash for this reason here. But there is indeed a check:
arrow/cpp/src/arrow/array/validate.cc
Lines 367 to 374 in 87b2fcd
| if (type.id() != Type::EXTENSION) { | |
| if (data.child_data.size() != static_cast<size_t>(type.num_fields())) { | |
| return Status::Invalid("Expected ", type.num_fields(), | |
| " child arrays in array " | |
| "of type ", | |
| type.ToString(), ", got ", data.child_data.size()); | |
| } | |
| } |
| T min = true; | ||
| T max = false; | ||
| bool has_nulls = false; | ||
| bool has_values = false; | ||
| }; | ||
|
|
||
| template <typename ArrowType, SimdLevel::type SimdLevel> | ||
| struct MinMaxState<ArrowType, SimdLevel, enable_if_integer<ArrowType>> { | ||
| struct MinMaxState<ArrowType, SimdLevel, enable_if_physical_integer<ArrowType>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this will generate different template specializations for types with the same underlying c_type. Is it desirable?
(note that GetValues is called with the output type independently from the ArrowType parameter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this and the one below to use PhysicalType. I also replaced GetValues with MakeScalar (since now the static type doesn't match the runtime type).
| if (value < util::string_view(this->min)) { | ||
| this->min = std::string(value); | ||
| } | ||
| if (value > util::string_view(this->max)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be else if? Presumably value cannot be bother smaller than the min and greater than the max, and it would spare a string comparison.
| Status Visit(const DataType&) { | ||
| return Status::NotImplemented("No min/max implemented"); | ||
| Status Visit(const DataType& ty) { | ||
| return Status::NotImplemented("No min/max implemented for ", ty); | ||
| } | ||
|
|
||
| Status Visit(const HalfFloatType&) { | ||
| return Status::NotImplemented("No min/max implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update message here as well?
| kernel = MakeKernel(std::move(argument_type), MinMaxInit<T>); | ||
| return Status::OK(); | ||
| } | ||
|
|
||
| // MSVC2015 apparently doesn't compile this properly if we use | ||
| // enable_if_floating_point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha :-)
| @@ -1866,17 +1940,40 @@ Result<std::unique_ptr<KernelState>> MinMaxInit(KernelContext* ctx, | |||
|
|
|||
| struct GroupedMinMaxFactory { | |||
| template <typename T> | |||
| enable_if_number<T, Status> Visit(const T&) { | |||
| enable_if_physical_integer<T, Status> Visit(const T&) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark as for the scalar aggregations: this will generate a separate kernel per logical type?
| [0.75, null], | ||
| [null, 3] | ||
| ])"}); | ||
| // bool, day time interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't see any interval in the type definition above. Is this comment outdated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the comments/docs.
docs/source/cpp/compute.rst
Outdated
| @@ -226,6 +229,9 @@ Notes: | |||
|
|
|||
| * \(3) Output is a ``{"min": input type, "max": input type}`` Struct. | |||
|
|
|||
| The month-day-nano interval type is not supported as it cannot be | |||
| sorted. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither is day-time, right?
docs/source/cpp/compute.rst
Outdated
| * \(3) Output is a ``{"min": input type, "max": input type}`` Struct scalar. | ||
| * \(3) Output is a ``{"min": input type, "max": input type}`` Struct array. | ||
|
|
||
| The month-day-nano interval type is not supported as it cannot be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
e9087e0 to
c05cb96
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @lidavidm
This adds support for non-nested types to both, except for MonthDayNanoInterval (which is not really sortable). hash_min_max additionally lacks binary/string-like types as they require a different approach (I will file a followup). Closes apache#11111 from lidavidm/arrow-13882 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This adds support for non-nested types to both, except for MonthDayNanoInterval (which is not really sortable). hash_min_max additionally lacks binary/string-like types as they require a different approach (I will file a followup).