-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41909: [C++] Add arrow::ArrayStatistics #43273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
bed8e3d
f7857b1
1ee4271
97a013c
0fbe9ef
b6b4754
2af686c
4639d58
e935d66
2c60782
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| // This empty .cc file is for embedding not inlined symbols in | ||
| // arrow::ArrayStatistics into libarrow. | ||
|
|
||
| #include "arrow/array/statistics.h" |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,76 @@ | ||||||||||
| // Licensed to the Apache Software Foundation (ASF) under one | ||||||||||
| // or more contributor license agreements. See the NOTICE file | ||||||||||
| // distributed with this work for additional information | ||||||||||
| // regarding copyright ownership. The ASF licenses this file | ||||||||||
| // to you under the Apache License, Version 2.0 (the | ||||||||||
| // "License"); you may not use this file except in compliance | ||||||||||
| // with the License. You may obtain a copy of the License at | ||||||||||
| // | ||||||||||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||
| // | ||||||||||
| // Unless required by applicable law or agreed to in writing, | ||||||||||
| // software distributed under the License is distributed on an | ||||||||||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||||||
| // KIND, either express or implied. See the License for the | ||||||||||
| // specific language governing permissions and limitations | ||||||||||
| // under the License. | ||||||||||
|
|
||||||||||
| #pragma once | ||||||||||
|
|
||||||||||
| #include <cstdint> | ||||||||||
| #include <optional> | ||||||||||
| #include <string> | ||||||||||
| #include <string_view> | ||||||||||
| #include <variant> | ||||||||||
|
|
||||||||||
| #include "arrow/util/float16.h" | ||||||||||
| #include "arrow/util/visibility.h" | ||||||||||
|
|
||||||||||
| namespace arrow { | ||||||||||
|
|
||||||||||
| /// \brief Statistics for an Array | ||||||||||
| /// | ||||||||||
| /// Apache Arrow format doesn't have statistics but data source such | ||||||||||
| /// as Apache Parquet may have statistics. Statistics associated with | ||||||||||
| /// data source can be read unified API via this class. | ||||||||||
| struct ARROW_EXPORT ArrayStatistics { | ||||||||||
| using ValueType = | ||||||||||
| std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, | ||||||||||
| uint64_t, util::Float16, float, double, std::string, std::string_view>; | ||||||||||
|
Comment on lines
+38
to
+39
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is too much IMO.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed with the above.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. Let's simplify this: #43578 |
||||||||||
|
|
||||||||||
| ArrayStatistics() = default; | ||||||||||
| ~ArrayStatistics() = default; | ||||||||||
|
Comment on lines
+41
to
+42
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these required at all? Perhaps for
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, they may not be needed. I didn't care about them.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||
|
|
||||||||||
| /// \brief The number of null values, may not be set | ||||||||||
| std::optional<int64_t> null_count = std::nullopt; | ||||||||||
|
|
||||||||||
| /// \brief The number of distinct values, may not be set | ||||||||||
| std::optional<int64_t> distinct_count = std::nullopt; | ||||||||||
|
|
||||||||||
| /// \brief The minimum value, may not be set | ||||||||||
| std::optional<ValueType> min = std::nullopt; | ||||||||||
|
||||||||||
|
|
||||||||||
| /// \brief Whether the minimum value is exact or not, may not be set | ||||||||||
| std::optional<bool> is_min_exact = std::nullopt; | ||||||||||
|
||||||||||
| /** If true, max_value is the actual maximum value for a column */ | |
| 7: optional bool is_max_value_exact; | |
| /** If true, min_value is the actual minimum value for a column */ | |
| 8: optional bool is_min_value_exact; |
We may need std::optional to support Parquet statistics. Can we keep this std::optional for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason it's optional in Parquet is simply for compatibility with older files. @wgtmac Is it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if it's unknown, then it's not exact. The third state is not useful IMHO.
I agree with @pitrou. We can simply mark it as not exact if this field is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Let's remove std::optional: #43594
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| #include <gtest/gtest.h> | ||
|
|
||
| #include "arrow/array/statistics.h" | ||
|
|
||
| namespace arrow { | ||
|
|
||
| TEST(ArrayStatisticsTest, TestNullCount) { | ||
| ArrayStatistics statistics; | ||
| ASSERT_FALSE(statistics.null_count.has_value()); | ||
| statistics.null_count = 29; | ||
| ASSERT_TRUE(statistics.null_count.has_value()); | ||
| ASSERT_EQ(29, statistics.null_count.value()); | ||
| } | ||
|
|
||
| TEST(ArrayStatisticsTest, TestDistinctCount) { | ||
| ArrayStatistics statistics; | ||
| ASSERT_FALSE(statistics.distinct_count.has_value()); | ||
| statistics.distinct_count = 29; | ||
| ASSERT_TRUE(statistics.distinct_count.has_value()); | ||
| ASSERT_EQ(29, statistics.distinct_count.value()); | ||
| } | ||
|
|
||
| TEST(ArrayStatisticsTest, TestMin) { | ||
| ArrayStatistics statistics; | ||
| ASSERT_FALSE(statistics.min.has_value()); | ||
| ASSERT_FALSE(statistics.is_min_exact.has_value()); | ||
| statistics.min = static_cast<int32_t>(29); | ||
| statistics.is_min_exact = true; | ||
| ASSERT_TRUE(statistics.min.has_value()); | ||
| ASSERT_TRUE(std::holds_alternative<int32_t>(statistics.min.value())); | ||
| ASSERT_EQ(29, std::get<int32_t>(statistics.min.value())); | ||
| ASSERT_TRUE(statistics.is_min_exact.has_value()); | ||
| ASSERT_TRUE(statistics.is_min_exact.value()); | ||
| } | ||
|
|
||
| TEST(ArrayStatisticsTest, TestMax) { | ||
| ArrayStatistics statistics; | ||
| ASSERT_FALSE(statistics.max.has_value()); | ||
| ASSERT_FALSE(statistics.is_max_exact.has_value()); | ||
| statistics.max = std::string("hello"); | ||
| statistics.is_max_exact = false; | ||
| ASSERT_TRUE(statistics.max.has_value()); | ||
| ASSERT_TRUE(std::holds_alternative<std::string>(statistics.max.value())); | ||
| ASSERT_EQ("hello", std::get<std::string>(statistics.max.value())); | ||
| ASSERT_TRUE(statistics.is_max_exact.has_value()); | ||
| ASSERT_FALSE(statistics.is_max_exact.value()); | ||
| } | ||
|
|
||
| TEST(ArrayStatisticsTest, TestEquality) { | ||
| ArrayStatistics statistics1; | ||
| ArrayStatistics statistics2; | ||
|
|
||
| ASSERT_EQ(statistics1, statistics2); | ||
|
|
||
| statistics1.null_count = 29; | ||
| ASSERT_NE(statistics1, statistics2); | ||
| statistics2.null_count = 29; | ||
| ASSERT_EQ(statistics1, statistics2); | ||
|
|
||
| statistics1.distinct_count = 2929; | ||
| ASSERT_NE(statistics1, statistics2); | ||
| statistics2.distinct_count = 2929; | ||
| ASSERT_EQ(statistics1, statistics2); | ||
|
|
||
| statistics1.min = std::string_view("world"); | ||
| ASSERT_NE(statistics1, statistics2); | ||
| statistics2.min = std::string_view("world"); | ||
| ASSERT_EQ(statistics1, statistics2); | ||
|
|
||
| statistics1.is_min_exact = false; | ||
| ASSERT_NE(statistics1, statistics2); | ||
| statistics2.is_min_exact = false; | ||
| ASSERT_EQ(statistics1, statistics2); | ||
|
|
||
| statistics1.max = arrow::util::Float16(-29); | ||
| ASSERT_NE(statistics1, statistics2); | ||
| statistics2.max = arrow::util::Float16(-29); | ||
| ASSERT_EQ(statistics1, statistics2); | ||
|
|
||
| statistics1.is_max_exact = true; | ||
| ASSERT_NE(statistics1, statistics2); | ||
| statistics2.is_max_exact = true; | ||
| ASSERT_EQ(statistics1, statistics2); | ||
| } | ||
|
|
||
| } // namespace arrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's meant to be usable with Parquet, then we also need to represent string/binary values (Parquet can store max/min stats of BYTE_ARRAY columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
It's a discussion point. See also: #42133 (comment)
We didn't have a good idea for string/binary values. But I have an idea now. How about adding both of
std::stringandstd::string_viewfor them? If we can store min/max values, we can usestd::string_viewto avoid copy of them. Otherwise, we can usestd::stringfor them.