-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@github-actions crossbow submit preview-docs |
|
|
|
I'm not a native English speaker. Wording suggestions are very welcome. I'll add examples after I implement a convenient API to C++. |
|
Revision: 22336f4d30e12cae6efb5961e4f822992576bcc5 Submitted crossbow builds: ursacomputing/crossbow @ actions-28c2a45b3d
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the Arrow IPC format? Can you add a sentence here that explains why we do not recommend using this to pass statistics over Arrow IPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)
But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.
So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.
I'll add something to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I should have done it.
|
@pdet please take a look and add comments if you have any, thanks! |
|
The format and contents LGTM! I was just slightly confused for one second that the second mapping is the value @Tmonster I think the proposed |
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ianmcook Thanks for your suggestions! I've merged all of them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)
But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.
So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.
I'll add something to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I should have done it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same as ADBC's one: https://github.com/apache/arrow-adbc/blob/05fa60d643c66b572d426ab28aa78fc52e9520e8/c/include/arrow-adbc/adbc.h#L524-L570
Ah, it makes sense. I'll improve it. Thanks. |
|
@github-actions crossbow submit preview-docs |
I've added the original DuckDB use case. DuckDB may be able to get statistics without I'll start a discussion on the mailing list tomorrow. |
|
The discussion thread: https://lists.apache.org/thread/b6chzlyn95rztoybs39b6olz907g12gj |
|
Based on the mailing list discussion, we don't limit this specification to only the Arrow C data interface. I close this in favor of GH-45058. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being (very) late to respond / review this proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have had trouble in the arrow-rs implementation or other places where schema's contain names but they aren't standardized
For example, we have an embedded field name for Lists and some implementations use "item" and some "element" which causes spurious schema mismatch errors
Therefore I also recommend removing field names unless there is some good reason for doing so (it isn't clear to me from the text why there are arbitrary field namds)
| * - key | ||
| - ``dictionary<indices: int32, dictionary: utf8>`` | ||
| - ``false`` | ||
| - Statistics key is string. Dictionary is used for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could be more explicitly about what the key means (it seems like it is the "statistics type" )?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This is also pointed out by @felipecrv :
- GH-38837: [Format] Add the specification for statistics schema #45058 (review)
- GH-38837: [Format] Add the specification for statistics schema #45058 (comment)
#45058 uses "statistics name" instead of "statistics key".
|
|
||
| .. _c-data-interface-statistics-key: | ||
|
|
||
| Statistics key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit is I would find the term "statistics type" rather than "statistics key" easier to understand
| > | ||
| > | ||
|
|
||
| Statistics array:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this example. I thought the statistics were structs, so I would have expected the data to look something like this (perhaps we could give the "logical contents" and then the specific array encoding):
[
// first struct element
{
column: null, # record batch
statistics: {
"ARROW:row_count:exact": 0
}
},
{
column: 0, # vendor_id
statistics: {
"ARROW:null_count:exact": 0,
"ARROW:distinct_count:exact": 2,
"ARROW:max_value:exact": 5,
"ARROW:min_value:exact": 1,
}
},
...
]
I can help work out the example if people think this is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it makes sense.
I used physical representation (columnar representation) here but logical representation (row based representation) may be easy to understand.
How about showing both of them? (We keep the current representation and add row based representation like you suggested.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue this in the current PR: https://github.com/apache/arrow/pull/45058/files#r1896967442
| vendor_id: [5, 1, 5, 1, 5] | ||
| passenger_count: [1, 1, 2, 0, null] | ||
|
|
||
| Statistics schema:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably too late, but I figured I would point out another format we use for statistics in DataFusion is a columnar form rather than a nested structure
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
For example to represent this data in DataFusion it would be
vendor_id::null_count |
vendor_id::min |
vendor_id::max |
... | passenger_count::max |
|---|---|---|---|---|
| 0 | 1 | 5 | ... | 2 |
The benefit of this encoding is that it can be used to quickly evaluate predicates on ranges (e.g. figure out if vendor_id = 6 could ever be true)
We use this format to read statistics from the parquet files (see ParquetStatisticsConverter to rule out row groups)
It does result in potentially very wide schemas however
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the existing statistics representation.
This is also shared by @tustvold in the initial discussion: https://lists.apache.org/thread/hd79kp59796sqmwo24vmsnbk8t6x7h8g
We chose the current approach than the DataFusion approach because the current approach will be easy to add support for not pre-defined (application-specific) statistics names (statistics types): https://lists.apache.org/thread/g362w785ntcdp0hjjqr06lk5gl4bfrj0
…dBatch::MakeStatisticsArray()`'s docstring (#45588) ### Rationale for this change `arrow::RecordBatch::MakeStatisticsArray()`'s docstring uses https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html not https://arrow.apache.org/docs/format/StatisticsSchema.html for statistics schema URL. Because #44252 assumed that we use #43553 but we use #45058 finally. ### What changes are included in this PR? Fix URL. ### Are these changes tested? It does not need since just a correction in document ### Are there any user-facing changes? No, Just a correction in document * GitHub Issue: #45587 Authored-by: arash andishgar <arashandishgar1@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…rrow::ArrayStatistics` (#44816) ### Rationale for this change I want to use this in the C data interface statistics documents: apache/arrow#43553 ### What changes are included in this PR? Add an executable that reads an Apache Parquet file and dumps statistics read as `arrow::ArrayStatistics`. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #44815 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…dBatch::MakeStatisticsArray()`'s docstring (#45588) ### Rationale for this change `arrow::RecordBatch::MakeStatisticsArray()`'s docstring uses https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html not https://arrow.apache.org/docs/format/StatisticsSchema.html for statistics schema URL. Because apache/arrow#44252 assumed that we use apache/arrow#43553 but we use apache/arrow#45058 finally. ### What changes are included in this PR? Fix URL. ### Are these changes tested? It does not need since just a correction in document ### Are there any user-facing changes? No, Just a correction in document * GitHub Issue: #45587 Authored-by: arash andishgar <arashandishgar1@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
Statistics are useful for fast query processing. Many query engines
use statistics to optimize their query plan.
Apache Arrow format doesn't have statistics but other formats that can
be read as Apache Arrow data may have statistics. For example, Apache
Parquet C++ can read Apache Parquet file as Apache Arrow data and
Apache Parquet file may have statistics.
One of the Arrow C data interface use cases is the following:
Arrow C data interface
If module A can pass the statistics associated with the Apache Parquet
file to module B through the Arrow C data interface, module B can use
the statistics to optimize its query plan.
What changes are included in this PR?
Add the specification to pass statistics through the Arrow C data interface based on the discussion on the
dev@mailing list: https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cxAre these changes tested?
Yes.
Are there any user-facing changes?
Yes.