-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads #4999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here's a benchmark showing a very common "worst case" for data that dictionary-encodes very well: Full notebook https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
Summary:
|
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive benchmarks!
also added support to pyarrow.table to invoke Table.from_arrays if a list or tuple of arrays is passed. This makes for more natural code IMHO.
I am fine with that, we just need to know that this means we can't really add support for list of rows (there was recently a JIRA about this), as well that it deviates from pandas.DataFrame(..) (which treats lists of arrays as list of rows. But anyway, since Table is a columnar store it totally makes sense to have list of columns as prime use case in pa.table(..)
python/pyarrow/parquet.py
Outdated
| @@ -500,49 +502,28 @@ def __str__(self): | |||
|
|
|||
| return result | |||
|
|
|||
| def get_metadata(self, open_file_func=None): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was still being used in dask 2.1.0, released a month ago: https://github.com/dask/dask/blob/ed33fbe6ec47e361d1f6f45b84acfe0a98e511ca/dask/dataframe/io/parquet.py#L860. But, it's fixed in the latest release 2.2 released a few days ago. So it might be fine to remove, but just that we are aware it was only fixed in dask very recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. This was deprecated in 0.13.0 so I think it's OK to remove since we had 2 major releases with the deprecated API and warning
python/pyarrow/parquet.py
Outdated
| self.metadata = ParquetFile(f, memory_map=memory_map).metadata | ||
| self.metadata = ParquetFile( | ||
| f, memory_map=memory_map, | ||
| read_dictionary=read_dictionary).metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the read_dictionary setting influence the metadata ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I can remove this. It does change the Arrow schema though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyarrow/parquet.py
Outdated
| @@ -1190,6 +1171,9 @@ def _make_manifest(path_or_paths, fs, pathsep='/', metadata_nthreads=1, | |||
| memory_map : boolean, default True | |||
| If the source is a file path, use a memory map to read file, which can | |||
| improve performance in some environments | |||
| read_dictionary : list, default None | |||
| List of column paths to read directly as DictionaryArray. Only supported | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "paths" might be a bit confusing for people not familiar with that parquet terminology. Column "names" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I don't think you can use Parquet and hide from this detail. To give an example of what I mean, you have to say field_name.list.item to refer to the inner column for a type like list<string>. I'm open to improving the usability of this but I don't want to spend a lot of energy on it while we have the Datasets C++ project pending in the near future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an example explaining this in the parquet section would already clarify a lot (and enough, I didn't meant to suggest to change the API itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column names or paths should be a good alternative that neither hides the format details nor confuses new users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll expand the docstring and give a couple examples
python/pyarrow/tests/test_parquet.py
Outdated
| pq.write_to_dataset(table, root_path=str(path)) | ||
| result = pq.ParquetDataset(path, read_dictionary=['f0']).read() | ||
| expected = pa.table([table[0].dictionary_encode()], names=['f0']) | ||
| assert result.equals(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this already work for a partitioned dataset with multiple parquet files where a the different files might have different set of unique values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each chunk in the table will have a different dictionary, yeah. So there shouldn't be any problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll expand the unit test to check explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Yeah I think a list-of-rows should be like a |
|
I'll fix the Windows DLL symbol visibility issues here shortly. @xhochy do you have any opinions about the API? |
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I like this API-wise
|
|
||
| The ``read_dictionary`` option in ``read_table`` and ``ParquetDataset`` will | ||
| cause columns to be read as ``DictionaryArray``, which will become | ||
| ``pandas.Categorical`` when converted to pandas. This option is only valid for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the limitation intended or simply because we only have it implemented for binary columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only implemented for BYTE_ARRAY columns at the moment. We could expand that but there is little benefit from a performance/memory use point of view for the primitive types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also used this (through pandas.Categorical) in the past on date and float types (e.g. in some datasets you can have 1000s of products that only have one of 5 prices). This often gave a 4-6x improvement in memory usage for these columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just dropping it here as FYI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'll open a JIRA as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyarrow/parquet.py
Outdated
| @@ -1190,6 +1171,9 @@ def _make_manifest(path_or_paths, fs, pathsep='/', metadata_nthreads=1, | |||
| memory_map : boolean, default True | |||
| If the source is a file path, use a memory map to read file, which can | |||
| improve performance in some environments | |||
| read_dictionary : list, default None | |||
| List of column paths to read directly as DictionaryArray. Only supported | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column names or paths should be a good alternative that neither hides the format details nor confuses new users.
|
I just pushed a docstring only fix. Merging this |
|
Oops, my doc fix broke the Python 2.7 build. I will fix |
I also added support to
pyarrow.tableto invokeTable.from_arraysif a list or tuple of arrays is passed. This makes for more natural code IMHO.Using this option with heavily compressed data results in far less memory use and much better performance. See example benchmarks
https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7