From 4afe27c5d1f7294d47a58af2b3c2bb36fbd77a2a Mon Sep 17 00:00:00 2001
From: Will Jones <willjones127@gmail.com>
Date: Mon, 23 Jan 2023 12:28:46 -0800
Subject: [PATCH 1/3] add file with NaN in statistics

---
 data/README.md            |   1 +
 data/nan_in_stats.parquet | Bin 0 -> 329 bytes
 2 files changed, 1 insertion(+)
 create mode 100755 data/nan_in_stats.parquet

diff --git a/data/README.md b/data/README.md
index b2c5128..e82ead6 100644
--- a/data/README.md
+++ b/data/README.md
@@ -40,6 +40,7 @@
 | overflow_i16_page_cnt.parquet                  | row group with more than INT16_MAX pages                   |
 | bloom_filter.bin                               | deprecated bloom filter binary with binary header and murmur3 hashing |
 | bloom_filter.xxhash.bin                        | bloom filter binary with thrift header and xxhash hashing    |
+| nan_in_stats.parquet                           | statistics contains nan in max, from PyArrow 0.8.0. See: https://github.com/apache/parquet-format/pull/185 |
 
 TODO: Document what each file is in the table above.
 
diff --git a/data/nan_in_stats.parquet b/data/nan_in_stats.parquet
new file mode 100755
index 0000000000000000000000000000000000000000..28b40443b2d207da2a73ea5f0f2b5abb8d5c7233
GIT binary patch
literal 329
zcmWG=3^EjD5mgXX@c~jSLJSN7HVk0!!5%{Ys261r6%rNG0m+N9iL%K^aKL0>tPl2L
zAR$f#CLqbe$jHp3wt-PbluOc-g@H{{g0VuBNsL8o0i)OoMl}yH1~CpCW@unB8EB#?
zlcbI*g9KY~az<)yq9_xCD3>Y|&{PI77D*XN8LHX^bfOpwgN9N;Vo_mfYKd-gL4iV9
kYEf}!ex8D%p0S>hZm^$YK(L2@h@^}R&~0;oH~<)k07d;VIsgCw

literal 0
HcmV?d00001


From 94c8d18c8b108585f1417411860b14a73ac5d41b Mon Sep 17 00:00:00 2001
From: Will Jones <willjones127@gmail.com>
Date: Wed, 25 Jan 2023 12:15:28 -0800
Subject: [PATCH 2/3] copy down relevant rules into readme

---
 data/README.md | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/data/README.md b/data/README.md
index e82ead6..dd4addd 100644
--- a/data/README.md
+++ b/data/README.md
@@ -38,9 +38,13 @@
 | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC          |
 | datapage_v1-corrupt-checksum.parquet           | uncompressed INT32 columns in v1 data pages with a mismatching CRC     |
 | overflow_i16_page_cnt.parquet                  | row group with more than INT16_MAX pages                   |
+<<<<<<< HEAD
 | bloom_filter.bin                               | deprecated bloom filter binary with binary header and murmur3 hashing |
 | bloom_filter.xxhash.bin                        | bloom filter binary with thrift header and xxhash hashing    |
 | nan_in_stats.parquet                           | statistics contains nan in max, from PyArrow 0.8.0. See: https://github.com/apache/parquet-format/pull/185 |
+=======
+| nan_in_stats.parquet                           | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats".  |
+>>>>>>> ab99cc1 (copy down relevant rules into readme)
 
 TODO: Document what each file is in the table above.
 
@@ -118,3 +122,17 @@ https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15
 
 `bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of
 https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33.
+
+## NaN in stats
+
+Previous versions of the C++ Parquet writer would write NaN values in min and max
+statistics. It has been updated since to ignore NaN values when calculating
+statistics, but for backwards compatibility the following rules were established
+(in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)):
+
+> For backwards compatibility when reading files:
+> * If the min is a NaN, it should be ignored.
+> * If the max is a NaN, it should be ignored.
+> * If the min is +0, the row group may contain -0 values as well.
+> * If the max is -0, the row group may contain +0 values as well.
+> * When looking for NaN values, min and max should be ignored.

From a44b2d686183f50ddc96264544e703189d914daa Mon Sep 17 00:00:00 2001
From: Will Jones <willjones127@gmail.com>
Date: Thu, 26 Jan 2023 08:34:29 -0800
Subject: [PATCH 3/3] add more explination

---
 data/README.md | 51 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/data/README.md b/data/README.md
index dd4addd..072a9d5 100644
--- a/data/README.md
+++ b/data/README.md
@@ -38,13 +38,9 @@
 | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC          |
 | datapage_v1-corrupt-checksum.parquet           | uncompressed INT32 columns in v1 data pages with a mismatching CRC     |
 | overflow_i16_page_cnt.parquet                  | row group with more than INT16_MAX pages                   |
-<<<<<<< HEAD
 | bloom_filter.bin                               | deprecated bloom filter binary with binary header and murmur3 hashing |
 | bloom_filter.xxhash.bin                        | bloom filter binary with thrift header and xxhash hashing    |
-| nan_in_stats.parquet                           | statistics contains nan in max, from PyArrow 0.8.0. See: https://github.com/apache/parquet-format/pull/185 |
-=======
 | nan_in_stats.parquet                           | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats".  |
->>>>>>> ab99cc1 (copy down relevant rules into readme)
 
 TODO: Document what each file is in the table above.
 
@@ -125,8 +121,9 @@ https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84
 
 ## NaN in stats
 
-Previous versions of the C++ Parquet writer would write NaN values in min and max
-statistics. It has been updated since to ignore NaN values when calculating
+Prior to version 1.4.0, the C++ Parquet writer would write NaN values in min and
+max statistics. (Correction in [this issue](https://issues.apache.org/jira/browse/PARQUET-1225)).
+It has been updated since to ignore NaN values when calculating
 statistics, but for backwards compatibility the following rules were established
 (in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)):
 
@@ -136,3 +133,45 @@ statistics, but for backwards compatibility the following rules were established
 > * If the min is +0, the row group may contain -0 values as well.
 > * If the max is -0, the row group may contain +0 values as well.
 > * When looking for NaN values, min and max should be ignored.
+
+The file `nan_in_stats.parquet` was generated with:
+
+```python
+import pyarrow as pa # version 0.8.0
+import pyarrow.parquet as pq
+from numpy import NaN
+
+tab = pa.Table.from_arrays(
+    [pa.array([1.0, NaN])],
+    names="x"
+)
+
+pq.write_table(tab, "nan_in_stats.parquet")
+
+metadata = pq.read_metadata("nan_in_stats.parquet")
+metadata.row_group(0).column(0)
+# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0>
+#   file_offset: 88
+#   file_path: 
+#   type: DOUBLE
+#   num_values: 2
+#   path_in_schema: x
+#   is_stats_set: True
+#   statistics:
+#     <pyarrow._parquet.RowGroupStatistics object at 0x7f28539e5738>
+#       has_min_max: True
+#       min: 1
+#       max: nan
+#       null_count: 0
+#       distinct_count: 0
+#       num_values: 2
+#       physical_type: DOUBLE
+#   compression: 1
+#   encodings: <map object at 0x7f28539eb4e0>
+#   has_dictionary_page: True
+#   dictionary_page_offset: 4
+#   data_page_offset: 36
+#   index_page_offset: 0
+#   total_compressed_size: 84
+#   total_uncompressed_size: 80
+```