ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads #4999

wesm · 2019-08-02T15:20:15Z

I also added support to pyarrow.table to invoke Table.from_arrays if a list or tuple of arrays is passed. This makes for more natural code IMHO.

Using this option with heavily compressed data results in far less memory use and much better performance. See example benchmarks

https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

wesm · 2019-08-02T15:39:23Z

Here's a benchmark showing a very common "worst case" for data that dictionary-encodes very well:

Full notebook https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

1000 unique strings of length 50
Total number of rows: 10 million
Parquet file is 1.1MB, small enough to fit on a floppy disk

Summary:

Using pq.read_table naive causes 516MB of memory consumption. That's almost 500x the size of the Parquet file on disk
Using pq.read_table(data, read_dictionary=['f0']) results in only 39MB memory consumption
The direct-dictionary read takes 106 ms on average compared with 1.8 seconds on average for the dense decoded case

jorisvandenbossche

Impressive benchmarks!

also added support to pyarrow.table to invoke Table.from_arrays if a list or tuple of arrays is passed. This makes for more natural code IMHO.

I am fine with that, we just need to know that this means we can't really add support for list of rows (there was recently a JIRA about this), as well that it deviates from pandas.DataFrame(..) (which treats lists of arrays as list of rows. But anyway, since Table is a columnar store it totally makes sense to have list of columns as prime use case in pa.table(..)

jorisvandenbossche · 2019-08-02T18:01:46Z

python/pyarrow/parquet.py

@@ -500,49 +502,28 @@ def __str__(self):

        return result

-    def get_metadata(self, open_file_func=None):


This was still being used in dask 2.1.0, released a month ago: https://github.com/dask/dask/blob/ed33fbe6ec47e361d1f6f45b84acfe0a98e511ca/dask/dataframe/io/parquet.py#L860. But, it's fixed in the latest release 2.2 released a few days ago. So it might be fine to remove, but just that we are aware it was only fixed in dask very recently.

Noted. This was deprecated in 0.13.0 so I think it's OK to remove since we had 2 major releases with the deprecated API and warning

jorisvandenbossche · 2019-08-02T18:03:56Z

python/pyarrow/parquet.py

-                self.metadata = ParquetFile(f, memory_map=memory_map).metadata
+                self.metadata = ParquetFile(
+                    f, memory_map=memory_map,
+                    read_dictionary=read_dictionary).metadata


Does the read_dictionary setting influence the metadata ?

no, I can remove this. It does change the Arrow schema though

jorisvandenbossche · 2019-08-02T18:06:56Z

python/pyarrow/parquet.py

@@ -1190,6 +1171,9 @@ def _make_manifest(path_or_paths, fs, pathsep='/', metadata_nthreads=1,
 memory_map : boolean, default True
    If the source is a file path, use a memory map to read file, which can
    improve performance in some environments
+read_dictionary : list, default None
+    List of column paths to read directly as DictionaryArray. Only supported


The "paths" might be a bit confusing for people not familiar with that parquet terminology. Column "names" ?

Hm. I don't think you can use Parquet and hide from this detail. To give an example of what I mean, you have to say field_name.list.item to refer to the inner column for a type like list<string>. I'm open to improving the usability of this but I don't want to spend a lot of energy on it while we have the Datasets C++ project pending in the near future

I think an example explaining this in the parquet section would already clarify a lot (and enough, I didn't meant to suggest to change the API itself).

column names or paths should be a good alternative that neither hides the format details nor confuses new users.

OK. I'll expand the docstring and give a couple examples

jorisvandenbossche · 2019-08-02T18:12:08Z

python/pyarrow/tests/test_parquet.py

+    pq.write_to_dataset(table, root_path=str(path))
+    result = pq.ParquetDataset(path, read_dictionary=['f0']).read()
+    expected = pa.table([table[0].dictionary_encode()], names=['f0'])
+    assert result.equals(expected)


Does this already work for a partitioned dataset with multiple parquet files where a the different files might have different set of unique values?

Each chunk in the table will have a different dictionary, yeah. So there shouldn't be any problem

I'll expand the unit test to check explicitly

wesm · 2019-08-02T19:02:51Z

I am fine with that, we just need to know that this means we can't really add support for list of rows (there was recently a JIRA about this), as well that it deviates from pandas.DataFrame(..) (which treats lists of arrays as list of rows. But anyway, since Table is a columnar store it totally makes sense to have list of columns as prime use case in pa.table(..)

Yeah I think a list-of-rows should be like a Table.from_records or similar

… and ParquetDataset

wesm · 2019-08-05T14:18:00Z

I'll fix the Windows DLL symbol visibility issues here shortly. @xhochy do you have any opinions about the API?

xhochy

+1, I like this API-wise

xhochy · 2019-08-05T14:31:17Z

docs/source/python/parquet.rst

+
+The ``read_dictionary`` option in ``read_table`` and ``ParquetDataset`` will
+cause columns to be read as ``DictionaryArray``, which will become
+``pandas.Categorical`` when converted to pandas. This option is only valid for


Is the limitation intended or simply because we only have it implemented for binary columns?

It's only implemented for BYTE_ARRAY columns at the moment. We could expand that but there is little benefit from a performance/memory use point of view for the primitive types

I've also used this (through pandas.Categorical) in the past on date and float types (e.g. in some datasets you can have 1000s of products that only have one of 5 prices). This often gave a 4-6x improvement in memory usage for these columns.

(just dropping it here as FYI)

I see. I'll open a JIRA as a follow up

https://issues.apache.org/jira/browse/ARROW-6140

xhochy · 2019-08-05T14:33:34Z

python/pyarrow/parquet.py

@@ -1190,6 +1171,9 @@ def _make_manifest(path_or_paths, fs, pathsep='/', metadata_nthreads=1,
 memory_map : boolean, default True
    If the source is a file path, use a memory map to read file, which can
    improve performance in some environments
+read_dictionary : list, default None
+    List of column paths to read directly as DictionaryArray. Only supported


column names or paths should be a good alternative that neither hides the format details nor confuses new users.

wesm · 2019-08-05T18:11:55Z

I just pushed a docstring only fix. Merging this

wesm · 2019-08-05T20:16:38Z

Oops, my doc fix broke the Python 2.7 build. I will fix

jorisvandenbossche reviewed Aug 2, 2019

View reviewed changes

wesm added 4 commits August 2, 2019 14:04

Initial threading of read_dictionary parameter, not terribly satisfying

85f9b72

Read Parquet fields directly as DictionaryArray in parquet.read_table…

9d50351

… and ParquetDataset

Fix C++ and Python unit tests

7237e69

Expand read_dictionary with ParquetDataset test for multiple files

8e2b70b

wesm force-pushed the ARROW-3325 branch from 70d3cc4 to 8e2b70b Compare August 2, 2019 19:53

Clean up FileReaderBuilder. Add simle Python docs

0f450d5

xhochy approved these changes Aug 5, 2019

View reviewed changes

wesm added 2 commits August 5, 2019 10:28

Add missing PARQUET_EXPORT

ee73d7b

Improve docstring for read_dictionary parameter, add to ParquetDataset

2ca3881

wesm closed this in 7aefa50 Aug 5, 2019

wesm deleted the ARROW-3325 branch August 5, 2019 18:13

This was referenced Jul 17, 2021

[Python] Support reading Parquet binary/string columns directly as DictionaryArray #19660

Closed

[C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY #22534

Closed

		@@ -500,49 +502,28 @@ def __str__(self):

		return result

		def get_metadata(self, open_file_func=None):

ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads #4999

ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads #4999

Uh oh!

Conversation

wesm commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Aug 2, 2019

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 2, 2019

Uh oh!

wesm commented Aug 5, 2019

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 5, 2019

Uh oh!

wesm commented Aug 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Aug 2, 2019 •

edited

Loading

wesm Aug 2, 2019 •

edited

Loading