From 4afe27c5d1f7294d47a58af2b3c2bb36fbd77a2a Mon Sep 17 00:00:00 2001 From: Will Jones Date: Mon, 23 Jan 2023 12:28:46 -0800 Subject: [PATCH 1/3] add file with NaN in statistics --- data/README.md | 1 + data/nan_in_stats.parquet | Bin 0 -> 329 bytes 2 files changed, 1 insertion(+) create mode 100755 data/nan_in_stats.parquet diff --git a/data/README.md b/data/README.md index b2c5128..e82ead6 100644 --- a/data/README.md +++ b/data/README.md @@ -40,6 +40,7 @@ | overflow_i16_page_cnt.parquet | row group with more than INT16_MAX pages | | bloom_filter.bin | deprecated bloom filter binary with binary header and murmur3 hashing | | bloom_filter.xxhash.bin | bloom filter binary with thrift header and xxhash hashing | +| nan_in_stats.parquet | statistics contains nan in max, from PyArrow 0.8.0. See: https://github.com/apache/parquet-format/pull/185 | TODO: Document what each file is in the table above. diff --git a/data/nan_in_stats.parquet b/data/nan_in_stats.parquet new file mode 100755 index 0000000000000000000000000000000000000000..28b40443b2d207da2a73ea5f0f2b5abb8d5c7233 GIT binary patch literal 329 zcmWG=3^EjD5mgXX@c~jSLJSN7HVk0!!5%{Ys261r6%rNG0m+N9iL%K^aKL0>tPl2L zAR$f#CLqbe$jHp3wt-PbluOc-g@H{{g0VuBNsL8o0i)OoMl}yH1~CpCW@unB8EB#? zlcbI*g9KY~az<)yq9_xCD3>Y|&{PI77D*XN8LHX^bfOpwgN9N;Vo_mfYKd-gL4iV9 kYEf}!ex8D%p0S>hZm^$YK(L2@h@^}R&~0;oH~<)k07d;VIsgCw literal 0 HcmV?d00001 From 94c8d18c8b108585f1417411860b14a73ac5d41b Mon Sep 17 00:00:00 2001 From: Will Jones Date: Wed, 25 Jan 2023 12:15:28 -0800 Subject: [PATCH 2/3] copy down relevant rules into readme --- data/README.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/data/README.md b/data/README.md index e82ead6..dd4addd 100644 --- a/data/README.md +++ b/data/README.md @@ -38,9 +38,13 @@ | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC | | datapage_v1-corrupt-checksum.parquet | uncompressed INT32 columns in v1 data pages with a mismatching CRC | | overflow_i16_page_cnt.parquet | row group with more than INT16_MAX pages | +<<<<<<< HEAD | bloom_filter.bin | deprecated bloom filter binary with binary header and murmur3 hashing | | bloom_filter.xxhash.bin | bloom filter binary with thrift header and xxhash hashing | | nan_in_stats.parquet | statistics contains nan in max, from PyArrow 0.8.0. See: https://github.com/apache/parquet-format/pull/185 | +======= +| nan_in_stats.parquet | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats". | +>>>>>>> ab99cc1 (copy down relevant rules into readme) TODO: Document what each file is in the table above. @@ -118,3 +122,17 @@ https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15 `bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33. + +## NaN in stats + +Previous versions of the C++ Parquet writer would write NaN values in min and max +statistics. It has been updated since to ignore NaN values when calculating +statistics, but for backwards compatibility the following rules were established +(in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)): + +> For backwards compatibility when reading files: +> * If the min is a NaN, it should be ignored. +> * If the max is a NaN, it should be ignored. +> * If the min is +0, the row group may contain -0 values as well. +> * If the max is -0, the row group may contain +0 values as well. +> * When looking for NaN values, min and max should be ignored. From a44b2d686183f50ddc96264544e703189d914daa Mon Sep 17 00:00:00 2001 From: Will Jones Date: Thu, 26 Jan 2023 08:34:29 -0800 Subject: [PATCH 3/3] add more explination --- data/README.md | 51 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 45 insertions(+), 6 deletions(-) diff --git a/data/README.md b/data/README.md index dd4addd..072a9d5 100644 --- a/data/README.md +++ b/data/README.md @@ -38,13 +38,9 @@ | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC | | datapage_v1-corrupt-checksum.parquet | uncompressed INT32 columns in v1 data pages with a mismatching CRC | | overflow_i16_page_cnt.parquet | row group with more than INT16_MAX pages | -<<<<<<< HEAD | bloom_filter.bin | deprecated bloom filter binary with binary header and murmur3 hashing | | bloom_filter.xxhash.bin | bloom filter binary with thrift header and xxhash hashing | -| nan_in_stats.parquet | statistics contains nan in max, from PyArrow 0.8.0. See: https://github.com/apache/parquet-format/pull/185 | -======= | nan_in_stats.parquet | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats". | ->>>>>>> ab99cc1 (copy down relevant rules into readme) TODO: Document what each file is in the table above. @@ -125,8 +121,9 @@ https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84 ## NaN in stats -Previous versions of the C++ Parquet writer would write NaN values in min and max -statistics. It has been updated since to ignore NaN values when calculating +Prior to version 1.4.0, the C++ Parquet writer would write NaN values in min and +max statistics. (Correction in [this issue](https://issues.apache.org/jira/browse/PARQUET-1225)). +It has been updated since to ignore NaN values when calculating statistics, but for backwards compatibility the following rules were established (in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)): @@ -136,3 +133,45 @@ statistics, but for backwards compatibility the following rules were established > * If the min is +0, the row group may contain -0 values as well. > * If the max is -0, the row group may contain +0 values as well. > * When looking for NaN values, min and max should be ignored. + +The file `nan_in_stats.parquet` was generated with: + +```python +import pyarrow as pa # version 0.8.0 +import pyarrow.parquet as pq +from numpy import NaN + +tab = pa.Table.from_arrays( + [pa.array([1.0, NaN])], + names="x" +) + +pq.write_table(tab, "nan_in_stats.parquet") + +metadata = pq.read_metadata("nan_in_stats.parquet") +metadata.row_group(0).column(0) +# +# file_offset: 88 +# file_path: +# type: DOUBLE +# num_values: 2 +# path_in_schema: x +# is_stats_set: True +# statistics: +# +# has_min_max: True +# min: 1 +# max: nan +# null_count: 0 +# distinct_count: 0 +# num_values: 2 +# physical_type: DOUBLE +# compression: 1 +# encodings: +# has_dictionary_page: True +# dictionary_page_offset: 4 +# data_page_offset: 36 +# index_page_offset: 0 +# total_compressed_size: 84 +# total_uncompressed_size: 80 +```