From 4a5446296fcc9f00ea1a4ee11fa66f10f5542e11 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 5 Aug 2024 18:19:50 +0900 Subject: [PATCH 01/19] GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface --- .../format/CDataInterfaceStatistics.rst | 192 ++++++++++++++++++ docs/source/format/Columnar.rst | 1 + docs/source/format/index.rst | 1 + 3 files changed, 194 insertions(+) create mode 100644 docs/source/format/CDataInterfaceStatistics.rst diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst new file mode 100644 index 00000000000..6ebf66962f1 --- /dev/null +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -0,0 +1,192 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _c-data-interface-statistics: + +===================================================== +Passing statistics through the Arrow C data interface +===================================================== + +Rationale +========= + +Statistics are useful for fast query processing. Many query engines +use statistics to optimize their query plan. + +Apache Arrow format doesn't have statistics but other formats that can +be read as Apache Arrow data may have statistics. For example, Apache +Parquet C++ can read Apache Parquet file as Apache Arrow data and +Apache Parquet file may have statistics. + +One of the Arrow C data interface use cases is the following: + +1. Module A reads Apache Parquet file as Apache Arrow data +2. Module A passes the read Apache Arrow data to module B through the + Arrow C data interface +3. Module B processes the passed Apache Arrow data + +If module A can pass the statistics associated with the Apache Parquet +file to module B through the Arrow C data interface, module B can use +the statistics to optimize its query plan. + +Goals +----- + +* Provide the standard way to pass statistics through the Arrow C data + interface to avoid reinventing the wheel. +* The standard way must be easy to use with the Arrow C data interface. + +Non-goals +--------- + +* Provide a common way to pass statistics that can be used for + other interfaces such Arrow Flight too. + +For example, ADBC has the statistics related APIs. This specification +doesn't replace them. + +.. _c-data-interface-statistics-schema: + +Schema +====== + +This specification provides only the schema for statistics. Producers +passes statistics as a map Arrow array that uses the schema through +the Arrow C data interface. + +Here is the schema for a statistics map Arrow Array: + +.. list-table:: + :header-rows: 1 + + * - Key or items + - Data type + - Nullable + - Notes + * - key + - ``int32`` + - ``true`` + - The column index or null if the statistics refer to whole table + or record batch. + * - items + - ``map`` + - ``false`` + - Statistics for the target column, table or record batch. See + the separated table for details. + +Here is the schema for the statistics map: + +.. list-table:: + :header-rows: 1 + + * - Key or items + - Data type + - Nullable + - Notes + * - key + - ``dictionary`` + - ``false`` + - Statistics key is string. Dictionary is used for + efficiency. Different keys are assigned for exact value and + approximate value. See also the separated description for + statistics key. + * - items + - ``dense_union`` + - ``false`` + - Statistics value is dense union. It has at least all needed + types based on statistics kinds in the keys. For example, you + need at least ``int64`` and ``float64`` types when you have a + ``int64`` distinct count statistic and a ``float64`` average + byte width statistic. See also the separated description for + statistics key. + + We don't standardize field names for the dense union because + consumers can access to proper field by index not name. So + producers can use any valid name for fields. + +.. _c-data-interface-statistics-key: + +Statistics key +-------------- + +Statistics key is string. ``dictionary`` is used for +efficiency. + +We assign different statistics keys for variants instead of using +flags. For example, we assign different statistics keys for exact +value and approximate value. + +The colon symbol ``:`` is to be used as a namespace separator like +:ref:`format_metadata`. It can be used multiple times in a key. + +The ``ARROW`` pattern is a reserved namespace for pre-defined +statistics keys. User-defined statistics must not use it. + +Here are pre-defined statistics keys: + +.. list-table:: + :header-rows: 1 + + * - Key + - Data type + - Notes + * - ``ARROW:average_byte_width:exact`` + - ``float`` + - The average size in bytes of a row in the target. (exact) + * - ``ARROW:average_byte_width:approximate`` + - ``float64`` + - The average size in bytes of a row in the target. (approximate) + * - ``ARROW:distinct_count:exact`` + - ``int64`` + - The number of distinct values in the target. (exact) + * - ``ARROW:distinct_count:approximate`` + - ``float64`` + - The number of distinct values in the target. (approximate) + * - ``ARROW:max_byte_width:exact`` + - ``int64`` + - The maximum size in bytes of a row in the target. (exact) + * - ``ARROW:max_byte_width:approximate`` + - ``float64`` + - The maximum size in bytes of a row in the target. (approximate) + * - ``ARROW:max_value:exact`` + - Target dependent + - The maximum value in the target. (exact) + * - ``ARROW:max_value:approximate`` + - Target dependent + - The maximum value in the target. (approximate) + * - ``ARROW:min_value:exact`` + - Target dependent + - The minimum value in the target. (exact) + * - ``ARROW:min_value:approximate`` + - Target dependent + - The minimum value in the target. (approximate) + * - ``ARROW:row_count:exact`` + - ``int64`` + - The number of rows in the target table or record batch. (exact) + * - ``ARROW:row_count:approximate`` + - ``float64`` + - The number of rows in the target table or record + batch. (approximate) + +If you find a missing statistics key that is usable for multiple +systems, please propose it on the `Arrow development mailing-list +`__. + +Examples +-------- + +TODO: Add at least C++ example. diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 33c937ea348..9ef6a933528 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1619,6 +1619,7 @@ example as above, an alternate encoding could be: :: 0 EOS +.. _format_metadata: Custom Application Metadata --------------------------- diff --git a/docs/source/format/index.rst b/docs/source/format/index.rst index ce31a15a1f3..f3ebfafe0bd 100644 --- a/docs/source/format/index.rst +++ b/docs/source/format/index.rst @@ -30,6 +30,7 @@ Specifications CanonicalExtensions Other CDataInterface + CDataInterfaceStatistics CStreamInterface CDeviceDataInterface DissociatedIPC From ae37a22f97b7334a5ea94f3398495aa2f48c0253 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 7 Aug 2024 11:39:56 +0900 Subject: [PATCH 02/19] Improve wording Co-authored-by: Ian Cook --- .../format/CDataInterfaceStatistics.rst | 39 ++++++++++--------- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 6ebf66962f1..f884826164c 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -28,16 +28,16 @@ Statistics are useful for fast query processing. Many query engines use statistics to optimize their query plan. Apache Arrow format doesn't have statistics but other formats that can -be read as Apache Arrow data may have statistics. For example, Apache -Parquet C++ can read Apache Parquet file as Apache Arrow data and -Apache Parquet file may have statistics. +be read as Apache Arrow data may have statistics. For example, the +Apache Parquet C++ implementation can read an Apache Parquet file as +Apache Arrow data and the Apache Parquet file may have statistics. One of the Arrow C data interface use cases is the following: -1. Module A reads Apache Parquet file as Apache Arrow data +1. Module A reads Apache Parquet file as Apache Arrow data. 2. Module A passes the read Apache Arrow data to module B through the - Arrow C data interface -3. Module B processes the passed Apache Arrow data + Arrow C data interface. +3. Module B processes the passed Apache Arrow data. If module A can pass the statistics associated with the Apache Parquet file to module B through the Arrow C data interface, module B can use @@ -46,9 +46,10 @@ the statistics to optimize its query plan. Goals ----- -* Provide the standard way to pass statistics through the Arrow C data - interface to avoid reinventing the wheel. -* The standard way must be easy to use with the Arrow C data interface. +* Establish a standard way to pass statistics through the Arrow C data + interface. +* Provide this in a manner that enables compatibility and ease of + implementation for existing users of the Arrow C data interface. Non-goals --------- @@ -64,11 +65,11 @@ doesn't replace them. Schema ====== -This specification provides only the schema for statistics. Producers -passes statistics as a map Arrow array that uses the schema through -the Arrow C data interface. +This specification provides only the schema for statistics. The +producer passes statistics through the Arrow C data interface as an +Arrow map array that uses this schema. -Here is the schema for a statistics map Arrow Array: +Here is the schema for a statistics Arrow map array: .. list-table:: :header-rows: 1 @@ -80,13 +81,13 @@ Here is the schema for a statistics map Arrow Array: * - key - ``int32`` - ``true`` - - The column index or null if the statistics refer to whole table - or record batch. + - The zero-based column index, or null if the statistics + describe the whole table or record batch. * - items - ``map`` - ``false`` - Statistics for the target column, table or record batch. See - the separated table for details. + the separate table below for details. Here is the schema for the statistics map: @@ -102,7 +103,7 @@ Here is the schema for the statistics map: - ``false`` - Statistics key is string. Dictionary is used for efficiency. Different keys are assigned for exact value and - approximate value. See also the separated description for + approximate value. Also see the separate description below for statistics key. * - items - ``dense_union`` @@ -111,8 +112,8 @@ Here is the schema for the statistics map: types based on statistics kinds in the keys. For example, you need at least ``int64`` and ``float64`` types when you have a ``int64`` distinct count statistic and a ``float64`` average - byte width statistic. See also the separated description for - statistics key. + byte width statistic. Also see the separate description below + for statistics key. We don't standardize field names for the dense union because consumers can access to proper field by index not name. So From 190ad787710e3eaff20dff64b60c074f5fa130b3 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 7 Aug 2024 11:58:10 +0900 Subject: [PATCH 03/19] Improve schema description --- .../format/CDataInterfaceStatistics.rst | 20 ++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index f884826164c..6a6544471bb 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -69,7 +69,20 @@ This specification provides only the schema for statistics. The producer passes statistics through the Arrow C data interface as an Arrow map array that uses this schema. -Here is the schema for a statistics Arrow map array: +Here is the outline of the schema for statistics:: + + map< + key: int32, + items: map< + key: dictionary< + indices: int32, + dictionary: utf8 + >, + items: dense_union<...all needed types...>, + > + > + +Here is the details of the top-level ``map``: .. list-table:: :header-rows: 1 @@ -89,7 +102,8 @@ Here is the schema for a statistics Arrow map array: - Statistics for the target column, table or record batch. See the separate table below for details. -Here is the schema for the statistics map: +Here is the details of the nested ``map`` as the items part of the +above ``map``: .. list-table:: :header-rows: 1 @@ -99,7 +113,7 @@ Here is the schema for the statistics map: - Nullable - Notes * - key - - ``dictionary`` + - ``dictionary`` - ``false`` - Statistics key is string. Dictionary is used for efficiency. Different keys are assigned for exact value and From 99ea21ea736f1044e8b8366a8e6ee9b86492b32c Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 7 Aug 2024 12:02:04 +0900 Subject: [PATCH 04/19] Link to ADBC's documentation --- docs/source/format/CDataInterfaceStatistics.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 6a6544471bb..c22b9242f1b 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -57,8 +57,9 @@ Non-goals * Provide a common way to pass statistics that can be used for other interfaces such Arrow Flight too. -For example, ADBC has the statistics related APIs. This specification -doesn't replace them. +For example, ADBC has `the statistics related APIs +`__. +This specification doesn't replace them. .. _c-data-interface-statistics-schema: From 5df2c672d4a6c122ae06b558e5cf76bc98568413 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 7 Aug 2024 12:56:05 +0900 Subject: [PATCH 05/19] Add can we use this for the Arrow IPC format --- docs/source/format/CDataInterfaceStatistics.rst | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index c22b9242f1b..35e597c1d58 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -32,7 +32,7 @@ be read as Apache Arrow data may have statistics. For example, the Apache Parquet C++ implementation can read an Apache Parquet file as Apache Arrow data and the Apache Parquet file may have statistics. -One of the Arrow C data interface use cases is the following: +One of :ref:`c-data-interface` use cases is the following: 1. Module A reads Apache Parquet file as Apache Arrow data. 2. Module A passes the read Apache Arrow data to module B through the @@ -61,6 +61,17 @@ For example, ADBC has `the statistics related APIs `__. This specification doesn't replace them. +This specification may fit some use cases of :ref:`format-ipc` not the +Arrow data interface. But we don't recommend this specification for +the Arrow IPC format for now. Because we may be able to define better +specification for the Arrow IPC format. The Arrow IPC format has some +different features compared with the Arrow C data interface. For +example, the Arrow IPC format can have :ref:`ipc-message-format +metadata for each message`. If you're interested in the specification +for passing statistics through the Arrow IPC format, please start a +discussion on the `Arrow development mailing-list +`__. + .. _c-data-interface-statistics-schema: Schema From 56d1b166f05bc417ac93959ca6bb1288e118eab1 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Tue, 27 Aug 2024 09:09:07 +0900 Subject: [PATCH 06/19] Remove outer map --- .../format/CDataInterfaceStatistics.rst | 27 +++++++++---------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 35e597c1d58..637c2c8b3f4 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -83,39 +83,36 @@ Arrow map array that uses this schema. Here is the outline of the schema for statistics:: - map< - key: int32, - items: map< - key: dictionary< - indices: int32, - dictionary: utf8 - >, - items: dense_union<...all needed types...>, - > + column: int32, + values: map< + key: dictionary< + indices: int32, + dictionary: utf8 + >, + items: dense_union<...all needed types...>, > -Here is the details of the top-level ``map``: +Here is the details of top-level columns: .. list-table:: :header-rows: 1 - * - Key or items + * - Name - Data type - Nullable - Notes - * - key + * - ``column`` - ``int32`` - ``true`` - The zero-based column index, or null if the statistics describe the whole table or record batch. - * - items + * - ``values`` - ``map`` - ``false`` - Statistics for the target column, table or record batch. See the separate table below for details. -Here is the details of the nested ``map`` as the items part of the -above ``map``: +Here is the details of the ``map`` of the ``values``: .. list-table:: :header-rows: 1 From 7780f8ffb7ed71af408454ace24d8595b39775b1 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 9 Sep 2024 11:24:59 +0900 Subject: [PATCH 07/19] Add missing top-level struct --- .../source/format/CDataInterfaceStatistics.rst | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 637c2c8b3f4..716e2dcfcef 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -83,16 +83,18 @@ Arrow map array that uses this schema. Here is the outline of the schema for statistics:: - column: int32, - values: map< - key: dictionary< - indices: int32, - dictionary: utf8 - >, - items: dense_union<...all needed types...>, + struct< + column: int32, + values: map< + key: dictionary< + indices: int32, + dictionary: utf8 + >, + items: dense_union<...all needed types...>, + > > -Here is the details of top-level columns: +Here is the details of top-level ``struct``: .. list-table:: :header-rows: 1 From eae18a6587b7033e1c6ef22eae1b72baa9539e56 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 30 Sep 2024 17:19:02 +0900 Subject: [PATCH 08/19] Add ARROW:null_count:{exact,approximate} --- docs/source/format/CDataInterfaceStatistics.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 716e2dcfcef..8a89160355a 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -200,6 +200,12 @@ Here are pre-defined statistics keys: * - ``ARROW:min_value:approximate`` - Target dependent - The minimum value in the target. (approximate) + * - ``ARROW:null_count:exact`` + - ``int64`` + - The number of nulls in the target. (exact) + * - ``ARROW:null_count:approximate`` + - ``float64`` + - The number of nulls in the target. (approximate) * - ``ARROW:row_count:exact`` - ``int64`` - The number of rows in the target table or record batch. (exact) From 189bda39f583d6c6dd0339e8731c312a9f2363b6 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 30 Sep 2024 17:20:07 +0900 Subject: [PATCH 09/19] Use "statistics" not "values" --- docs/source/format/CDataInterfaceStatistics.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 8a89160355a..bd2f9353ad7 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -85,7 +85,7 @@ Here is the outline of the schema for statistics:: struct< column: int32, - values: map< + statistics: map< key: dictionary< indices: int32, dictionary: utf8 @@ -108,13 +108,13 @@ Here is the details of top-level ``struct``: - ``true`` - The zero-based column index, or null if the statistics describe the whole table or record batch. - * - ``values`` + * - ``statistics`` - ``map`` - ``false`` - Statistics for the target column, table or record batch. See the separate table below for details. -Here is the details of the ``map`` of the ``values``: +Here is the details of the ``map`` of the ``statistics``: .. list-table:: :header-rows: 1 From 9f2f0170beebad00ac16c1a24458bc9107b3234c Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 4 Dec 2024 14:33:23 +0900 Subject: [PATCH 10/19] Add a C++ example --- .../parquet/parquet_dump_arrow_statistics.cc | 2 + .../format/CDataInterfaceStatistics.rst | 57 +++++++++++++++++-- 2 files changed, 53 insertions(+), 6 deletions(-) diff --git a/cpp/tools/parquet/parquet_dump_arrow_statistics.cc b/cpp/tools/parquet/parquet_dump_arrow_statistics.cc index 8aeced94f6a..f121e2fe0b9 100644 --- a/cpp/tools/parquet/parquet_dump_arrow_statistics.cc +++ b/cpp/tools/parquet/parquet_dump_arrow_statistics.cc @@ -23,6 +23,7 @@ #include namespace { +// doc: start: print-arrow-statistics arrow::Status PrintArrowStatistics(const char* path) { ARROW_ASSIGN_OR_RAISE( auto input, arrow::io::MemoryMappedFile::Open(path, arrow::io::FileMode::READ)); @@ -39,6 +40,7 @@ arrow::Status PrintArrowStatistics(const char* path) { } return arrow::Status::OK(); } +// doc: end: print-arrow-statistics }; // namespace int main(int argc, char** argv) { diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index bd2f9353ad7..f52c0cbde64 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -66,9 +66,9 @@ Arrow data interface. But we don't recommend this specification for the Arrow IPC format for now. Because we may be able to define better specification for the Arrow IPC format. The Arrow IPC format has some different features compared with the Arrow C data interface. For -example, the Arrow IPC format can have :ref:`ipc-message-format -metadata for each message`. If you're interested in the specification -for passing statistics through the Arrow IPC format, please start a +example, the Arrow IPC format can have :ref:`metadata for each message +`. If you're interested in the specification for +passing statistics through the Arrow IPC format, please start a discussion on the `Arrow development mailing-list `__. @@ -219,6 +219,51 @@ systems, please propose it on the `Arrow development mailing-list `__. Examples --------- - -TODO: Add at least C++ example. +======== + +Here are some examples to help you understand. + +C++ +--- + +The C++ implementation provides convenience features to create a +statistics array. + +You can attach statistics to an :cpp:class:`arrow::Array`. Statistics +of an array is represented as :cpp:class:`arrow::ArrayStatistics`. + +If you build :cpp:class:`arrow::Array` s from a Parquet file, you +don't need to attach statistics in a Parquet file +explicitly. :cpp:class:`parquet::arrow::FileReader` attaches +statistics in a Parquet file automatically. + +If you have a :cpp:class:`arrow::RecordBatch` that has +:cpp:class:`arrow::Array` that has statistics, you can use +:cpp:func:`arrow::RecordBatch::MakeStatisticsArray()`. It builds an +:cpp:class:`arrow::Array` for statistics from attached statistics. The +built statistics array uses the statistics schema defined in this +documentation. + +Here is an example that reads record batches from a Parquet file and +prints statistics array for each record batch. Each record batch has +associated statistics when the Parquet file has statistics. The +important part of this example is +:cpp:func:`arrow::RecordBatch::MakeStatisticsArray` call. You can +build a statistics :cpp:class:`arrow::Array` easily by it. + +.. literalinclude:: ../../../cpp/tools/parquet/parquet_dump_arrow_statistics.cc + :language: cpp + :start-after: doc: start: print-arrow-statistics + :end-before: doc: end: print-arrow-statistics + +You can pass a statistics :cpp:class:`arrow::Array` created by +:cpp:func:`arrow::RecordBatch::MakeStatisticsArray` to another system +in the same process with the normal C data interface. For example, you +can use :cpp:func:`arrow::ExportArray` to export a statistics +:cpp:class:`arrow::Array`: + +.. code-block:: cpp + + ArrowArray exported_statistics_array; + arrow::Status status = arrow::ExportArray(*statistics_array, &exported_statistics_array); + // Pass exported_statistics_array to other system. From 4644418bf5020871cf0f3e71c3b4824b7efb7eac Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Thu, 5 Dec 2024 10:21:01 +0900 Subject: [PATCH 11/19] Add experimental warning --- docs/source/format/CDataInterfaceStatistics.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index f52c0cbde64..00793a6144a 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -21,6 +21,8 @@ Passing statistics through the Arrow C data interface ===================================================== +.. warning:: This specification should be considered experimental. + Rationale ========= From 595d916678cacdebb58bde2dfdb0d1ca43a5fb1f Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 15:08:03 +0900 Subject: [PATCH 12/19] index -> type code --- docs/source/format/CDataInterfaceStatistics.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 00793a6144a..a3b298ed029 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -143,7 +143,7 @@ Here is the details of the ``map`` of the ``statistics``: for statistics key. We don't standardize field names for the dense union because - consumers can access to proper field by index not name. So + consumers can access to proper field by type code not name. So producers can use any valid name for fields. .. _c-data-interface-statistics-key: From 5fbddf4332e37644f07a9712f1789fb5d2dcb85b Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 15:15:40 +0900 Subject: [PATCH 13/19] Clarify wording --- docs/source/format/CDataInterfaceStatistics.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index a3b298ed029..188959124c7 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -154,7 +154,7 @@ Statistics key Statistics key is string. ``dictionary`` is used for efficiency. -We assign different statistics keys for variants instead of using +We assign different statistics keys for individual statistics instead of using flags. For example, we assign different statistics keys for exact value and approximate value. From 6815b998a826740ae481920fcecd968bcb2125d7 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 15:16:34 +0900 Subject: [PATCH 14/19] Add user-defined statistics example --- docs/source/format/CDataInterfaceStatistics.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 188959124c7..008bc0eb024 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -163,6 +163,8 @@ The colon symbol ``:`` is to be used as a namespace separator like The ``ARROW`` pattern is a reserved namespace for pre-defined statistics keys. User-defined statistics must not use it. +For example, you can use your product name as namespace +such as `MY_PRODUCT:my_statistics:exact`. Here are pre-defined statistics keys: From f744c8040df5d25b766148e7eb94d79479bf672b Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 15:20:04 +0900 Subject: [PATCH 15/19] Fix a typo --- docs/source/format/CDataInterfaceStatistics.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 008bc0eb024..3b31409484e 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -175,7 +175,7 @@ Here are pre-defined statistics keys: - Data type - Notes * - ``ARROW:average_byte_width:exact`` - - ``float`` + - ``float64`` - The average size in bytes of a row in the target. (exact) * - ``ARROW:average_byte_width:approximate`` - ``float64`` From ed0cbe2bd3de8a21b2bb676ee8a9c0c708eb8b63 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 16:55:24 +0900 Subject: [PATCH 16/19] Update * Add the original DuckDB use case * Add TODOs * Clarify "column index": It uses the flattened index * Clarify statistics target * Use concrete data not C++ for examples --- .../format/CDataInterfaceStatistics.rst | 185 ++++++++++++------ 1 file changed, 123 insertions(+), 62 deletions(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 3b31409484e..0eebde8331c 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -34,20 +34,34 @@ be read as Apache Arrow data may have statistics. For example, the Apache Parquet C++ implementation can read an Apache Parquet file as Apache Arrow data and the Apache Parquet file may have statistics. -One of :ref:`c-data-interface` use cases is the following: +Use case +-------- + +One of :ref:`c-stream-interface` use cases is the following: 1. Module A reads Apache Parquet file as Apache Arrow data. 2. Module A passes the read Apache Arrow data to module B through the - Arrow C data interface. + Arrow C stream interface. 3. Module B processes the passed Apache Arrow data. If module A can pass the statistics associated with the Apache Parquet -file to module B through the Arrow C data interface, module B can use -the statistics to optimize its query plan. +file to module B through the Arrow C stream interface, module B can +use the statistics to optimize its query plan. + +For example, DuckDB uses this approach but DuckDB couldn't use +statistics because there wasn't the standardized way to pass +statistics. + +.. seealso:: + + `duckdb::ArrowTableFunction::ArrowScanBind() in DuckDB 1.1.3 + `_ Goals ----- +TODO: Remove the C data interface limitation? + * Establish a standard way to pass statistics through the Arrow C data interface. * Provide this in a manner that enables compatibility and ease of @@ -56,6 +70,8 @@ Goals Non-goals --------- +TODO: Remove the C data interface limitation? + * Provide a common way to pass statistics that can be used for other interfaces such Arrow Flight too. @@ -92,7 +108,7 @@ Here is the outline of the schema for statistics:: indices: int32, dictionary: utf8 >, - items: dense_union<...all needed types...>, + items: dense_union<...all needed types...> > > @@ -110,6 +126,9 @@ Here is the details of top-level ``struct``: - ``true`` - The zero-based column index, or null if the statistics describe the whole table or record batch. + + The column index is computed as the same rule used by + :ref:`ipc-recordbatch-message`. * - ``statistics`` - ``map`` - ``false`` @@ -176,40 +195,40 @@ Here are pre-defined statistics keys: - Notes * - ``ARROW:average_byte_width:exact`` - ``float64`` - - The average size in bytes of a row in the target. (exact) + - The average size in bytes of a row in the target column. (exact) * - ``ARROW:average_byte_width:approximate`` - - ``float64`` - - The average size in bytes of a row in the target. (approximate) + - ``float64``: TODO: Should we use ``int64`` instead? + - The average size in bytes of a row in the target column. (approximate) * - ``ARROW:distinct_count:exact`` - ``int64`` - - The number of distinct values in the target. (exact) + - The number of distinct values in the target column. (exact) * - ``ARROW:distinct_count:approximate`` - ``float64`` - - The number of distinct values in the target. (approximate) + - The number of distinct values in the target column. (approximate) * - ``ARROW:max_byte_width:exact`` - ``int64`` - - The maximum size in bytes of a row in the target. (exact) + - The maximum size in bytes of a row in the target column. (exact) * - ``ARROW:max_byte_width:approximate`` - ``float64`` - - The maximum size in bytes of a row in the target. (approximate) + - The maximum size in bytes of a row in the target column. (approximate) * - ``ARROW:max_value:exact`` - Target dependent - - The maximum value in the target. (exact) + - The maximum value in the target column. (exact) * - ``ARROW:max_value:approximate`` - Target dependent - - The maximum value in the target. (approximate) + - The maximum value in the target column. (approximate) * - ``ARROW:min_value:exact`` - Target dependent - - The minimum value in the target. (exact) + - The minimum value in the target column. (exact) * - ``ARROW:min_value:approximate`` - Target dependent - - The minimum value in the target. (approximate) + - The minimum value in the target column. (approximate) * - ``ARROW:null_count:exact`` - ``int64`` - - The number of nulls in the target. (exact) + - The number of nulls in the target column. (exact) * - ``ARROW:null_count:approximate`` - ``float64`` - - The number of nulls in the target. (approximate) + - The number of nulls in the target column. (approximate) * - ``ARROW:row_count:exact`` - ``int64`` - The number of rows in the target table or record batch. (exact) @@ -227,47 +246,89 @@ Examples Here are some examples to help you understand. -C++ ---- - -The C++ implementation provides convenience features to create a -statistics array. - -You can attach statistics to an :cpp:class:`arrow::Array`. Statistics -of an array is represented as :cpp:class:`arrow::ArrayStatistics`. - -If you build :cpp:class:`arrow::Array` s from a Parquet file, you -don't need to attach statistics in a Parquet file -explicitly. :cpp:class:`parquet::arrow::FileReader` attaches -statistics in a Parquet file automatically. - -If you have a :cpp:class:`arrow::RecordBatch` that has -:cpp:class:`arrow::Array` that has statistics, you can use -:cpp:func:`arrow::RecordBatch::MakeStatisticsArray()`. It builds an -:cpp:class:`arrow::Array` for statistics from attached statistics. The -built statistics array uses the statistics schema defined in this -documentation. - -Here is an example that reads record batches from a Parquet file and -prints statistics array for each record batch. Each record batch has -associated statistics when the Parquet file has statistics. The -important part of this example is -:cpp:func:`arrow::RecordBatch::MakeStatisticsArray` call. You can -build a statistics :cpp:class:`arrow::Array` easily by it. - -.. literalinclude:: ../../../cpp/tools/parquet/parquet_dump_arrow_statistics.cc - :language: cpp - :start-after: doc: start: print-arrow-statistics - :end-before: doc: end: print-arrow-statistics - -You can pass a statistics :cpp:class:`arrow::Array` created by -:cpp:func:`arrow::RecordBatch::MakeStatisticsArray` to another system -in the same process with the normal C data interface. For example, you -can use :cpp:func:`arrow::ExportArray` to export a statistics -:cpp:class:`arrow::Array`: - -.. code-block:: cpp - - ArrowArray exported_statistics_array; - arrow::Status status = arrow::ExportArray(*statistics_array, &exported_statistics_array); - // Pass exported_statistics_array to other system. +Simple record batch +------------------- + +Schema:: + + vendor_id: int32 + passenger_count: int64 + +Data:: + + vendor_id: [5, 1, 5, 1, 5] + passenger_count: [1, 1, 2, 0, null] + +Statistics schema:: + + struct< + column: int32, + statistics: map< + key: dictionary< + indices: int32, + dictionary: utf8 + >, + items: dense_union + > + > + +Statistics array:: + + column: [ + null, # record batch + 0, # vendor_id + 0, # vendor_id + 0, # vendor_id + 0, # vendor_id + 1, # passenger_count + 1, # passenger_count + 1, # passenger_count + 1, # passenger_count + ] + statistics: + key: + indices: [ + 0, # "ARROW:row_count:exact" + 1, # "ARROW:null_count:exact" + 2, # "ARROW:distinct_count:exact" + 3, # "ARROW:max_value:exact" + 4, # "ARROW:min_value:exact" + 1, # "ARROW:null_count:exact" + 2, # "ARROW:distinct_count:exact" + 3, # "ARROW:max_value:exact" + 4, # "ARROW:min_value:exact" + ] + dictionary: [ + "ARROW:row_count:exact", + "ARROW:null_count:exact", + "ARROW:distinct_count:exact", + "ARROW:max_value:exact", + "ARROW:min_value:exact", + ], + items: [ + 5, # record batch: "ARROW:row_count:exact" + 0, # vendor_id: "ARROW:null_count:exact" + 2, # vendor_id: "ARROW:distinct_count:exact" + 5, # vendor_id: "ARROW:max_value:exact" + 1, # vendor_id: "ARROW:min_value:exact" + 1, # passenger_count: "ARROW:null_count:exact" + 3, # passenger_count: "ARROW:distinct_count:exact" + 4, # passenger_count: "ARROW:max_value:exact" + 0, # passenger_count: "ARROW:min_value:exact" + ] + +Complex record batch +-------------------- + +TODO: It uses nested type. + + +Simple array +------------ + +TODO + +Complex array +------------- + +TODO: It uses nested type. From 50ca4c5714da7b2c5de5f60ddb10c52c8fe75615 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 17:00:31 +0900 Subject: [PATCH 17/19] Add one more TODO --- docs/source/format/CDataInterfaceStatistics.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 0eebde8331c..6d7782f54c7 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -165,6 +165,8 @@ Here is the details of the ``map`` of the ``statistics``: consumers can access to proper field by type code not name. So producers can use any valid name for fields. + TODO: Should we standardize field names? + .. _c-data-interface-statistics-key: Statistics key From 18255204bb1ff176c68ef6a4cd1cc814362c2642 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 17:02:28 +0900 Subject: [PATCH 18/19] Add one more TODO --- docs/source/format/CDataInterfaceStatistics.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index 6d7782f54c7..a2cae90b016 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -79,6 +79,9 @@ For example, ADBC has `the statistics related APIs `__. This specification doesn't replace them. +TODO: Should we deprecate the current ADBC's statistics API and +redesign with this specification? + This specification may fit some use cases of :ref:`format-ipc` not the Arrow data interface. But we don't recommend this specification for the Arrow IPC format for now. Because we may be able to define better From 2d707414036105920ff2bf5e49aee0128f7eb6a8 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 11 Dec 2024 17:10:23 +0900 Subject: [PATCH 19/19] Fix syntax --- docs/source/format/CDataInterfaceStatistics.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CDataInterfaceStatistics.rst b/docs/source/format/CDataInterfaceStatistics.rst index a2cae90b016..0d15381deb9 100644 --- a/docs/source/format/CDataInterfaceStatistics.rst +++ b/docs/source/format/CDataInterfaceStatistics.rst @@ -188,7 +188,7 @@ The colon symbol ``:`` is to be used as a namespace separator like The ``ARROW`` pattern is a reserved namespace for pre-defined statistics keys. User-defined statistics must not use it. For example, you can use your product name as namespace -such as `MY_PRODUCT:my_statistics:exact`. +such as ``MY_PRODUCT:my_statistics:exact``. Here are pre-defined statistics keys: