Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Sep 8, 2021

This adds support for non-nested types to both, except for MonthDayNanoInterval (which is not really sortable). hash_min_max additionally lacks binary/string-like types as they require a different approach (I will file a followup).

@lidavidm lidavidm changed the title ARROW-13882: [C++] Improve min_max/hash_min_max type support. ARROW-13882: [C++] Improve min_max/hash_min_max type support Sep 8, 2021
@github-actions
Copy link

github-actions bot commented Sep 8, 2021

@lidavidm
Copy link
Member Author

In light of the proposal to clarify how DayTime intervals are handled some changes here should probably be reverted.

@lidavidm
Copy link
Member Author

(Removed support for DayTimeInterval since it's only partially orderable.)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you. Just a couple questions and nits.

@@ -498,6 +498,11 @@ struct ValidateArrayFullImpl {

Status Visit(const StructType& type) {
// Validate children
if (data.child_data.size() != static_cast<size_t>(type.num_fields())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised. This should already be checked by the ValidateArray top-level function, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Odd, I do remember getting a crash for this reason here. But there is indeed a check:

if (type.id() != Type::EXTENSION) {
if (data.child_data.size() != static_cast<size_t>(type.num_fields())) {
return Status::Invalid("Expected ", type.num_fields(),
" child arrays in array "
"of type ",
type.ToString(), ", got ", data.child_data.size());
}
}

T min = true;
T max = false;
bool has_nulls = false;
bool has_values = false;
};

template <typename ArrowType, SimdLevel::type SimdLevel>
struct MinMaxState<ArrowType, SimdLevel, enable_if_integer<ArrowType>> {
struct MinMaxState<ArrowType, SimdLevel, enable_if_physical_integer<ArrowType>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this will generate different template specializations for types with the same underlying c_type. Is it desirable?
(note that GetValues is called with the output type independently from the ArrowType parameter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this and the one below to use PhysicalType. I also replaced GetValues with MakeScalar (since now the static type doesn't match the runtime type).

if (value < util::string_view(this->min)) {
this->min = std::string(value);
}
if (value > util::string_view(this->max)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be else if? Presumably value cannot be bother smaller than the min and greater than the max, and it would spare a string comparison.

Status Visit(const DataType&) {
return Status::NotImplemented("No min/max implemented");
Status Visit(const DataType& ty) {
return Status::NotImplemented("No min/max implemented for ", ty);
}

Status Visit(const HalfFloatType&) {
return Status::NotImplemented("No min/max implemented");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update message here as well?

kernel = MakeKernel(std::move(argument_type), MinMaxInit<T>);
return Status::OK();
}

// MSVC2015 apparently doesn't compile this properly if we use
// enable_if_floating_point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha :-)

@@ -1866,17 +1940,40 @@ Result<std::unique_ptr<KernelState>> MinMaxInit(KernelContext* ctx,

struct GroupedMinMaxFactory {
template <typename T>
enable_if_number<T, Status> Visit(const T&) {
enable_if_physical_integer<T, Status> Visit(const T&) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark as for the scalar aggregations: this will generate a separate kernel per logical type?

[0.75, null],
[null, 3]
])"});
// bool, day time interval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see any interval in the type definition above. Is this comment outdated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the comments/docs.

@@ -226,6 +229,9 @@ Notes:

* \(3) Output is a ``{"min": input type, "max": input type}`` Struct.

The month-day-nano interval type is not supported as it cannot be
sorted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither is day-time, right?

* \(3) Output is a ``{"min": input type, "max": input type}`` Struct scalar.
* \(3) Output is a ``{"min": input type, "max": input type}`` Struct array.

The month-day-nano interval type is not supported as it cannot be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

@lidavidm lidavidm force-pushed the arrow-13882 branch 2 times, most recently from e9087e0 to c05cb96 Compare September 15, 2021 15:22
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @lidavidm

@pitrou pitrou closed this in 2c405da Sep 16, 2021
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This adds support for non-nested types to both, except for MonthDayNanoInterval (which is not really sortable). hash_min_max additionally lacks binary/string-like types as they require a different approach (I will file a followup).

Closes apache#11111 from lidavidm/arrow-13882

Lead-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants