-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-43797: [C++] Attach arrow::ArrayStatistics to arrow::ArrayData
#43801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@pitrou @bkietz @felipecrv What do you think about this approach? |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making statistics a "lazy" member of ArrayData is the right approach. It should probably be wrapped in a pointer, though: this will ensure that new members can be added to ArrayStatistics without impacting the size of ArrayData
8ee9b69 to
4d5b234
Compare
|
OK. I've changed to a pointer instead of embedding |
|
What just raise my curiousity is that |
|
Most statistics would be invalidated by slicing, such as the distinct and null counts. The minimum and maximum could be preserved, but would have to be demoted to inexact until recomputed. |
…yData` If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning.
4d5b234 to
942f757
Compare
|
Good catch! I forgot the I noticed that |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM but I'm not familiar with details in this
|
If nobody objects this, I'll merge this in the next week. |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 4ed5a14. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 29 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…yData` (apache#43801) ### Rationale for this change If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning. If `ArrayData` not `Array` has statistics, we can use statistics in computing kernels. There was a concern that associated `arrow::ArrayStatistics` may be outdated if `arrow::ArrayData` is mutated after attaching `arrow::ArrayStatistics`. But `arrow::ArrayData` isn't mutable after the first population. So `arrow::ArrayStatistics` will not be outdated. We can require mutators to take responsibility for statistics. ### What changes are included in this PR? * Add `arrow::ArrayData::statistics` * Add `arrow::Array::statistics()` to get statistics attached in `arrow::ArrayData` This doesn't provide a new `arrow::ArrayData` constructor (`arrow::ArrayData::Make()`) that accepts `arrow::ArrayStatistics`. We can change `arrow::ArrayData::statistics` after we create `arrow::ArrayData`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. `arrow::Array::statistics()` is a new public API. * GitHub Issue: apache#43797 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
| /// object which backs this Array. | ||
| /// | ||
| /// \return const ArrayStatistics& | ||
| std::shared_ptr<ArrayStatistics> statistics() const { return data_->statistics; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type should be std::shared_ptr<ArrayStatistics>& and we should probably add const ArrayStatistics to the shared_ptr so that callers can't mutate the statistics through the shared pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I don't know why I missed const and & here...
Let's add them: GH-44590
Rationale for this change
If we can attach associated statistics to an array via
ArrayData, we can use it in later processes such as query planning.If
ArrayDatanotArrayhas statistics, we can use statistics in computing kernels.There was a concern that associated
arrow::ArrayStatisticsmay be outdated ifarrow::ArrayDatais mutated after attachingarrow::ArrayStatistics. Butarrow::ArrayDataisn't mutable after the first population. Soarrow::ArrayStatisticswill not be outdated. We can require mutators to take responsibility for statistics.What changes are included in this PR?
arrow::ArrayData::statisticsarrow::Array::statistics()to get statistics attached inarrow::ArrayDataThis doesn't provide a new
arrow::ArrayDataconstructor (arrow::ArrayData::Make()) that acceptsarrow::ArrayStatistics. We can changearrow::ArrayData::statisticsafter we createarrow::ArrayData.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
arrow::Array::statistics()is a new public API.arrow::ArrayStatisticstoarrow::ArrayData#43797