Add file with NaN in statistics #35

wjones127 · 2023-01-23T20:29:47Z

File was generated with:

import pyarrow as pa # version 0.8.0
import pyarrow.parquet as pq
from numpy import NaN

tab = pa.Table.from_arrays(
    [pa.array([1.0, NaN])],
    names="x"
)

pq.write_table(tab, "nan_in_stats.parquet")

metadata = pq.read_metadata("nan_in_stats.parquet")
metadata.row_group(0).column(0)
# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0>
#   file_offset: 88
#   file_path: 
#   type: DOUBLE
#   num_values: 2
#   path_in_schema: x
#   is_stats_set: True
#   statistics:
#     <pyarrow._parquet.RowGroupStatistics object at 0x7f28539e5738>
#       has_min_max: True
#       min: 1
#       max: nan
#       null_count: 0
#       distinct_count: 0
#       num_values: 2
#       physical_type: DOUBLE
#   compression: 1
#   encodings: <map object at 0x7f28539eb4e0>
#   has_dictionary_page: True
#   dictionary_page_offset: 4
#   data_page_offset: 36
#   index_page_offset: 0
#   total_compressed_size: 84
#   total_uncompressed_size: 80

data/README.md

pitrou · 2023-01-26T13:22:29Z

data/README.md

 | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC          |
 | datapage_v1-corrupt-checksum.parquet           | uncompressed INT32 columns in v1 data pages with a mismatching CRC     |
 | overflow_i16_page_cnt.parquet                  | row group with more than INT16_MAX pages                   |
+<<<<<<< HEAD


Hmm, can you remove the merge markers here?

pitrou · 2023-01-26T13:22:51Z

data/README.md

+
+## NaN in stats
+
+Previous versions of the C++ Parquet writer would write NaN values in min and max


"Previous versions" is quite unspecific, can you be more explciit?

I noted the Parquet version that changed in.

pitrou · 2023-01-26T13:23:35Z

data/README.md

+> * If the max is a NaN, it should be ignored.
+> * If the min is +0, the row group may contain -0 values as well.
+> * If the max is -0, the row group may contain +0 values as well.
+> * When looking for NaN values, min and max should be ignored.


I think it would be useful to add the "File was generated with: [...]" snippet that is part of the PR description.

I was thinking it would always be in the PR history, but your are right that in the text would be more convenient.

pitrou

Thanks @wjones127 !

wjones127 mentioned this pull request Jan 23, 2023

GH-28074: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down apache/arrow#15125

Merged

pitrou reviewed Jan 24, 2023

View reviewed changes

data/README.md Outdated Show resolved Hide resolved

wjones127 added 2 commits January 25, 2023 12:16

add file with NaN in statistics

4afe27c

copy down relevant rules into readme

94c8d18

wjones127 force-pushed the nan-in-stats branch from f14becb to 94c8d18 Compare January 25, 2023 20:18

pitrou requested changes Jan 26, 2023

View reviewed changes

add more explination

a44b2d6

wjones127 requested a review from pitrou January 26, 2023 16:36

pitrou approved these changes Jan 30, 2023

View reviewed changes

pitrou changed the title ~~add file with NaN in statistics~~ Add file with NaN in statistics Jan 30, 2023

pitrou merged commit 33b4e23 into apache:master Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file with NaN in statistics #35

Add file with NaN in statistics #35

Uh oh!

wjones127 commented Jan 23, 2023

Uh oh!

Uh oh!

pitrou Jan 26, 2023

Uh oh!

wjones127 Jan 26, 2023

Uh oh!

pitrou Jan 26, 2023

Uh oh!

wjones127 Jan 26, 2023

Uh oh!

pitrou Jan 26, 2023

Uh oh!

wjones127 Jan 26, 2023

Uh oh!

pitrou left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## NaN in stats

		Previous versions of the C++ Parquet writer would write NaN values in min and max

Add file with NaN in statistics #35

Add file with NaN in statistics #35

Uh oh!

Conversation

wjones127 commented Jan 23, 2023

Uh oh!

Uh oh!

pitrou Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

pitrou Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

pitrou Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants