Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Aug 2, 2019

I also added support to pyarrow.table to invoke Table.from_arrays if a list or tuple of arrays is passed. This makes for more natural code IMHO.

Using this option with heavily compressed data results in far less memory use and much better performance. See example benchmarks

https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

@wesm
Copy link
Member Author

wesm commented Aug 2, 2019

Here's a benchmark showing a very common "worst case" for data that dictionary-encodes very well:

Full notebook https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

  • 1000 unique strings of length 50
  • Total number of rows: 10 million
  • Parquet file is 1.1MB, small enough to fit on a floppy disk

Summary:

  • Using pq.read_table naive causes 516MB of memory consumption. That's almost 500x the size of the Parquet file on disk
  • Using pq.read_table(data, read_dictionary=['f0']) results in only 39MB memory consumption
  • The direct-dictionary read takes 106 ms on average compared with 1.8 seconds on average for the dense decoded case

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive benchmarks!

also added support to pyarrow.table to invoke Table.from_arrays if a list or tuple of arrays is passed. This makes for more natural code IMHO.

I am fine with that, we just need to know that this means we can't really add support for list of rows (there was recently a JIRA about this), as well that it deviates from pandas.DataFrame(..) (which treats lists of arrays as list of rows. But anyway, since Table is a columnar store it totally makes sense to have list of columns as prime use case in pa.table(..)

@@ -500,49 +502,28 @@ def __str__(self):

return result

def get_metadata(self, open_file_func=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was still being used in dask 2.1.0, released a month ago: https://github.com/dask/dask/blob/ed33fbe6ec47e361d1f6f45b84acfe0a98e511ca/dask/dataframe/io/parquet.py#L860. But, it's fixed in the latest release 2.2 released a few days ago. So it might be fine to remove, but just that we are aware it was only fixed in dask very recently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. This was deprecated in 0.13.0 so I think it's OK to remove since we had 2 major releases with the deprecated API and warning

self.metadata = ParquetFile(f, memory_map=memory_map).metadata
self.metadata = ParquetFile(
f, memory_map=memory_map,
read_dictionary=read_dictionary).metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the read_dictionary setting influence the metadata ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I can remove this. It does change the Arrow schema though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1190,6 +1171,9 @@ def _make_manifest(path_or_paths, fs, pathsep='/', metadata_nthreads=1,
memory_map : boolean, default True
If the source is a file path, use a memory map to read file, which can
improve performance in some environments
read_dictionary : list, default None
List of column paths to read directly as DictionaryArray. Only supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "paths" might be a bit confusing for people not familiar with that parquet terminology. Column "names" ?

Copy link
Member Author

@wesm wesm Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I don't think you can use Parquet and hide from this detail. To give an example of what I mean, you have to say field_name.list.item to refer to the inner column for a type like list<string>. I'm open to improving the usability of this but I don't want to spend a lot of energy on it while we have the Datasets C++ project pending in the near future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an example explaining this in the parquet section would already clarify a lot (and enough, I didn't meant to suggest to change the API itself).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column names or paths should be a good alternative that neither hides the format details nor confuses new users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll expand the docstring and give a couple examples

pq.write_to_dataset(table, root_path=str(path))
result = pq.ParquetDataset(path, read_dictionary=['f0']).read()
expected = pa.table([table[0].dictionary_encode()], names=['f0'])
assert result.equals(expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this already work for a partitioned dataset with multiple parquet files where a the different files might have different set of unique values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each chunk in the table will have a different dictionary, yeah. So there shouldn't be any problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll expand the unit test to check explicitly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wesm
Copy link
Member Author

wesm commented Aug 2, 2019

I am fine with that, we just need to know that this means we can't really add support for list of rows (there was recently a JIRA about this), as well that it deviates from pandas.DataFrame(..) (which treats lists of arrays as list of rows. But anyway, since Table is a columnar store it totally makes sense to have list of columns as prime use case in pa.table(..)

Yeah I think a list-of-rows should be like a Table.from_records or similar

@wesm
Copy link
Member Author

wesm commented Aug 5, 2019

I'll fix the Windows DLL symbol visibility issues here shortly. @xhochy do you have any opinions about the API?

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I like this API-wise


The ``read_dictionary`` option in ``read_table`` and ``ParquetDataset`` will
cause columns to be read as ``DictionaryArray``, which will become
``pandas.Categorical`` when converted to pandas. This option is only valid for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the limitation intended or simply because we only have it implemented for binary columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only implemented for BYTE_ARRAY columns at the moment. We could expand that but there is little benefit from a performance/memory use point of view for the primitive types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also used this (through pandas.Categorical) in the past on date and float types (e.g. in some datasets you can have 1000s of products that only have one of 5 prices). This often gave a 4-6x improvement in memory usage for these columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just dropping it here as FYI)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'll open a JIRA as a follow up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1190,6 +1171,9 @@ def _make_manifest(path_or_paths, fs, pathsep='/', metadata_nthreads=1,
memory_map : boolean, default True
If the source is a file path, use a memory map to read file, which can
improve performance in some environments
read_dictionary : list, default None
List of column paths to read directly as DictionaryArray. Only supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column names or paths should be a good alternative that neither hides the format details nor confuses new users.

@wesm
Copy link
Member Author

wesm commented Aug 5, 2019

I just pushed a docstring only fix. Merging this

@wesm wesm closed this in 7aefa50 Aug 5, 2019
@wesm wesm deleted the ARROW-3325 branch August 5, 2019 18:13
@wesm
Copy link
Member Author

wesm commented Aug 5, 2019

Oops, my doc fix broke the Python 2.7 build. I will fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants