-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-44010: [C++] Add arrow::RecordBatch::MakeStatisticsArray()
#44252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
903e3f4 to
92afc83
Compare
92afc83 to
b194430
Compare
|
@pitrou @ianmcook What do you think about this? Statistics schema https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R86-R95 is compact but it may be complex to build. Because it uses many nested types. |
5a00c48 to
12b1a97
Compare
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may forgot a bit but we don't distinct "bytes" and "utf8" in stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we didn't discuss it...
Let's discuss it in #44579.
We can assume "utf8" here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a // TODO(GH-44579) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I should have added it...
I've added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot to push the commit... I pushed now.
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know constexpr std::string_view is better or this is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use constexpr because this header may be used by C programs.
cpp/src/arrow/record_batch.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So RowCount is also handled as a stats 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Statistics array will be passed to consumer before consumer receives a record batch.
So this may be useful for consumer.
But DuckDB doesn't have row count in its BaseStatistics...: https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146
This may not be useful...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this for now to demonstrate table/record batch level statistics.
8e4d618 to
9c529d1
Compare
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a careful pass tonight
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a // TODO(GH-44579) here?
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM but I'm not an expert on C ABI and data layer
cpp/src/arrow/record_batch.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually this is logically a "set" prepared for items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
If there are the same types, the first type is only used.
cpp/src/arrow/record_batch.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually this is for a two-phase building, one pass for types, and one-pass for data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
I think that it's one of complexities.
So I sent https://lists.apache.org/thread/0c9jftkspvj7yw1lpo73s3vtp6vfjqv8 to the mailing list. But nobody agreed it. So this complexity will be acceptable...
9c529d1 to
ab80bf9
Compare
It's a convenient function that converts `arrow::ArrayStatistics` in a `arrow::RecordBatch` to `arrow::Array` for the Arrow C data interface.
ab80bf9 to
e93d0f4
Compare
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this in a few days is nobody objects this.
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot to push the commit... I pushed now.
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit d748ace. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…dBatch::MakeStatisticsArray()`'s docstring (#45588) ### Rationale for this change `arrow::RecordBatch::MakeStatisticsArray()`'s docstring uses https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html not https://arrow.apache.org/docs/format/StatisticsSchema.html for statistics schema URL. Because #44252 assumed that we use #43553 but we use #45058 finally. ### What changes are included in this PR? Fix URL. ### Are these changes tested? It does not need since just a correction in document ### Are there any user-facing changes? No, Just a correction in document * GitHub Issue: #45587 Authored-by: arash andishgar <arashandishgar1@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…dBatch::MakeStatisticsArray()`'s docstring (#45588) ### Rationale for this change `arrow::RecordBatch::MakeStatisticsArray()`'s docstring uses https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html not https://arrow.apache.org/docs/format/StatisticsSchema.html for statistics schema URL. Because apache/arrow#44252 assumed that we use apache/arrow#43553 but we use apache/arrow#45058 finally. ### What changes are included in this PR? Fix URL. ### Are these changes tested? It does not need since just a correction in document ### Are there any user-facing changes? No, Just a correction in document * GitHub Issue: #45587 Authored-by: arash andishgar <arashandishgar1@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
Statistics schema for Arrow C data interface (GH-43553) is complex because it uses nested types (struct, map and union). So reusable implementation to make statistics array is useful.
What changes are included in this PR?
arrow::RecordBatch::MakeStatisticsArray()is a convenient function that convertsarrow::ArrayStatisticsin aarrow::RecordBatchtoarrow::Arrayfor the Arrow C data interface.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
arrow::ArrayStatisticstoarrow::Arrayfor the Arrow C data interface #44010