ARROW-8039: [Python] Use dataset API in existing parquet readers and tests #6303

jorisvandenbossche · 2020-01-28T14:48:11Z

This is testing to optionally use the dataset API in the pyarrow parquet reader implementation (read_table and ParquetDataset().read()).
Currently, it is enabled by passing use_legacy_dataset=False (mechanism to opt in to be discussed), which allows to run our existing parquet tests with this (the approach I now took is to parametrize the existing tests for use_legacy_dataset True/False).

This allows users to do:

table = pq.read_table("my_parquet_file_or_dir", columns=.., filters=.., use_legacy_dataset=False)

or

dataset = pq.ParquetDataset("my_parquet_dir/", use_legacy_dataset=False)
table = table.read(...)

and with the idea that at some point, the default for use_legacy_dataset would switch from True to False.

Long term, I think we certainly want to keep pq.read_table (and I think we will also be able to support most of its keywords).

The future for pq.ParquetDataset is less clear (it has a lot of API that is tied to the python implementation, eg the ParquetDatasetPiece, PartitionSet, ParquetPartitions, .. classes). We probably want to move people towards a "new" ParquetDataset that is more consistent with the new general datasets API. Therefore, right now the ParquetDataset(use_legacy_dataset=False) does not yet try to provide all those features, but just the read() method. We can later see which extra features we add and how advanced users of ParquetDataset can move to the new API.

github-actions · 2020-01-28T15:01:41Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

bkietz

Looks great, few typos/questions

Personally, I'd prefer to flip the condition from use_dataset=False to use_legacy_dataset=True

python/pyarrow/parquet.py

bkietz · 2020-03-24T17:12:21Z

python/pyarrow/tests/test_parquet.py

This seems like it could be fixed just for ParquetDatasetV2 by uniqueing the column names, is that unfavorable?

Yes, that should be easy to do.
But, I would prefer that we decide on this for the new Datasets API what we want (deduplicating the passed columns, or returning duplicated columns), and follow that here. Otherwise it creates an inconsistency between this and the pyarrow.dataset API. So therefore I left it as a TODO for now (with the TODO being the need to bring this up / take a decision).

I think we will not deduplicate the column names in C++ @fsaintjacques

Fields within a schema may have duplicated field names so it seems unlikely that we'll move amenities like deduplication to the lowest level.

python/pyarrow/tests/test_parquet.py

bkietz · 2020-03-24T17:30:18Z

python/pyarrow/tests/test_parquet.py

Is this critical? This would be fairly easy to recover with an option like discover_dictionaries in partition schema discovery

Similar as above: I would prefer that we first decice what we want long term, rather than exactly mimicking the old API (certainly given it is opt-in for now).

Given the variable dictionary support we have now, returning dictionary encoded fields shouldn't (in principle?) be too hard and also should be faster than duplicating a string many times. In the worst case, a record batch would have a field having all 0 indices and a dictionary with a single value.

python/pyarrow/tests/test_parquet.py

jorisvandenbossche · 2020-03-26T13:40:10Z

Personally, I'd prefer to flip the condition from use_dataset=False to use_legacy_dataset=True

I like that suggestion

wesm · 2020-03-30T21:57:29Z

Working on reviewing this, will get you feedback as soon as I can

python/pyarrow/parquet.py

python/pyarrow/tests/test_parquet.py

python/pyarrow/parquet.py

bkietz · 2020-03-30T22:17:40Z

python/pyarrow/tests/test_parquet.py

We'll have to discuss how to expose metadata in pyarrow.dataset. IMO this should be a property of ParquetFileFragment and not of a Dataset

We'll have to discuss how to expose metadata in pyarrow.dataset. IMO this should be a property of ParquetFileFragment and not of a Dataset

Yes, indeed, this could be something specific on the ParquetFileFragment, although a ParquetDataset can still have a "metadata" that maps to all metadata of all files/row groups, if available (eg if the dataset was created from a _metadata file with all this information).

And with the future work on storing statistics on fragments, we might also get a parquet-independent access to part of the metadata.

_metadata serves as a consolidated mapping from paths or row groups to metadata, so I'd say that even in that case the metadata derived from it is a per-Fragment property.

wesm

This is good progress, thank you all for working on this. I'm not sure what to do longer term about this test suite, which was a bit of a rat's nest already

Can we collect somewhere a list of what features of the old ParquetDataset are not supported, and then we can decide what must be implemented or what will be left on the chopping block?

python/pyarrow/tests/test_parquet.py

python/pyarrow/parquet.py

wesm · 2020-03-31T00:25:15Z

python/pyarrow/parquet.py

Having all the arguments and their defaults duplicated here and then the additional checking in ParquetDatasetV2 is a bit unsatisfying. Is there a better way?

The __new__ was needed (as far as I remember and understand) to get pickling to work for the old ParquetDataset

python/pyarrow/parquet.py

wesm · 2020-03-31T02:49:29Z

python/pyarrow/parquet.py

May want to try to extract this and the same section from the legacy reader (can do so later)

wesm · 2020-03-31T02:56:10Z

python/pyarrow/parquet.py

Is discovery / metadata analysis multithreaded by default?

wesm · 2020-03-31T03:31:33Z

python/pyarrow/tests/test_parquet.py

Given the variable dictionary support we have now, returning dictionary encoded fields shouldn't (in principle?) be too hard and also should be faster than duplicating a string many times. In the worst case, a record batch would have a field having all 0 indices and a dictionary with a single value.

python/pyarrow/tests/test_parquet.py

jorisvandenbossche · 2020-03-31T11:01:11Z

Thanks for the feedback!

Some non-inline responses for easier visibility:

I'm not sure what to do longer term about this test suite, which was a bit of a rat's nest already

Splitting it in multiple files might also help (eg pure parquet tests (ParquetFile, metadata, statistics) vs parquet dataset tests). And at some point if we can remove the old dataset code, that will also help of course ;)

Can we collect somewhere a list of what features of the old ParquetDataset are not supported, and then we can decide what must be implemented or what will be left on the chopping block?

Yes, will do.

Is discovery / metadata analysis multithreaded by default?

Right now, this is not multithreaded (also not optional). There is a JIRA for this: https://issues.apache.org/jira/browse/ARROW-8137

[about setting use_threads to False for testing the datasets code path] Should this be handled in the parquet.py module? How is multithreading exposed or not exposed currently in the datasets API in Python?

In the datasets API, there is a use_threads keyword in Dataset.to_table(), with a default of True. The parquet module also has a use_threads keyword, so this can be mapped to the new API.

The main problem here is that in the new datasets API, no deterministic row order (order of the record batches) is guaranteed when using multi-threading. So for testing equality of tables, we turned multithreading off for now (see eg https://issues.apache.org/jira/browse/ARROW-7719, #6319 for a PR that fixed failing tests due to this in test_dataset.py).

Personally, I would prefer to see deterministic results in the Datasets API as well (IMO this is what users will expect, although this can also be handled to some extent by the library that uses this code (e.g. dask)), but there was some disagreement about this. I don't think this discussion was captured in a JIRA (except the one about failing tests).

[regarding partition columns as dictionary type] https://github.com/apache/arrow/pull/6303/files#r397335585

It indeed should be possible (@bkietz fixed a bug preventing this), so you can already manually specify the partitioning schema with dictionary types, and that works now. But we could also have an option to automatically do this?
I am personally ambivalent about doing this automatically (by default) or not. On the one hand, it can give performance/memory benefits, but on the other hand dictionary types (or categorical when converted to pandas) can be a bit more difficult to work with if you expected a "normal" type.

bkietz

nit

python/pyarrow/parquet.py

…ed keywords

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

jorisvandenbossche · 2020-04-08T17:40:02Z

All green here.

kszucs · 2020-04-08T17:53:17Z

python/pyarrow/parquet.py

-    This function also supports passing in as List[Tuple]. These predicates
-    are evaluated as a conjunction. To express OR in predicates, one must
-    use the (preferred) List[List[Tuple]] notation.
+    implements partition-level (hive) filtering, i.e., to prevent the


I find the hive mention a bit confusing here. Could you add clearer note that depending on use_legacy_dataset what can the user filter on?

@kszucs @bkietz can you check if the latest docstring is clearer? (and thanks @bkietz for the wording suggestions!)

kszucs · 2020-04-08T17:56:17Z

python/pyarrow/parquet.py

+                    "Dataset API".format(keyword))
+
+        # map old filesystems to new one
+        # TODO(dataset) deal with other file systems


Would it be hard to map HadoopFileSystem and S3FSWrapper as well?
Either way please file a jira ticket for it.

The question might also be if we actually want to add such mapping (for LocalFileSystem I mainly did it to get the tests passing right now).
As in the end, we also want that people switch to the new filesystems, so maybe we should rather give that message (certainly if they do the use_legacy_dataset=False opt-in, they can also use the new filesystem objects?)

python/pyarrow/parquet.py

kszucs

Just minor comments, nicely done @jorisvandenbossche!

bkietz

Just a few suggestions which might improve clarity

python/pyarrow/_dataset.pyx

python/pyarrow/parquet.py

bkietz · 2020-04-08T20:10:06Z

python/pyarrow/parquet.py

+        # map old filesystems to new one
+        # TODO(dataset) deal with other file systems
+        if isinstance(filesystem, LocalFileSystem):
+            filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)


For a follow up ticket, maybe memory_map should default to None so that we can reconcile it with an instance of pa.fs.LocalFileSystem

Suggested change

filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)

filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)

elif isinstance(filesystem, pyarrow.fs.LocalFileSystem):

assert memory_map is None

?

maybe memory_map should default to None so that we can reconcile it with an instance of pa.fs.LocalFileSystem

Yeah, if we keep it as an option to the filesystem, that sounds as a good idea. But there also has been some discussion recently about not tying it to the filesytem.

python/pyarrow/parquet.py

python/pyarrow/tests/test_parquet.py

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

bkietz

Thanks again @jorisvandenbossche!

merging

jorisvandenbossche · 2020-04-30T15:45:55Z

I finally listed the open TODO items from the discussions in this PR / the skipped tests, and opened JIRAs where this was not yet the case:

Deduplicating the specified column names (ARROW-8039: [Python] Use dataset API in existing parquet readers and tests #6303 (comment)): do we actually want this?
Support buffers/NativeFile as file source -> https://issues.apache.org/jira/browse/ARROW-8074
Multithreaded discovery -> https://issues.apache.org/jira/browse/ARROW-8137
Deterministic row order -> https://issues.apache.org/jira/browse/ARROW-8447
Error on "bad" files instead of skipping -> https://issues.apache.org/jira/browse/ARROW-7673
Partition fields using dictionary type -> opened https://issues.apache.org/jira/browse/ARROW-8647 (but should this be on by default?)
Metadata support: https://issues.apache.org/jira/browse/ARROW-8062 is an issue about discovery from metadata files, but in addition we also need a way to expose the metadata/statistics
Support pickling -> opened https://issues.apache.org/jira/browse/ARROW-8651
Comment about needing better error message when encountering invalid files: this seems to work now, opened https://issues.apache.org/jira/browse/ARROW-8652 to enable the test

jorisvandenbossche force-pushed the parquet-dataset-experiment branch from 2c2cd83 to 4f04f69 Compare January 28, 2020 14:49

jorisvandenbossche force-pushed the parquet-dataset-experiment branch from 4f04f69 to bdb2948 Compare February 6, 2020 15:12

kszucs force-pushed the master branch from b18ed44 to e79c251 Compare February 7, 2020 07:41

kszucs force-pushed the parquet-dataset-experiment branch from bdb2948 to 22967e0 Compare February 7, 2020 10:06

jorisvandenbossche force-pushed the parquet-dataset-experiment branch from 22967e0 to 3f7add8 Compare March 23, 2020 14:41

jorisvandenbossche changed the title ~~POC: use dataset API in existing parquet tests~~ ARROW-8039: [Python] Use dataset API in existing parquet tests Mar 23, 2020

jorisvandenbossche requested a review from kszucs March 23, 2020 17:12

jorisvandenbossche force-pushed the parquet-dataset-experiment branch from 0bea92d to 0f16f99 Compare March 24, 2020 14:58

bkietz requested changes Mar 24, 2020

View reviewed changes

jorisvandenbossche force-pushed the parquet-dataset-experiment branch 2 times, most recently from e6a0c37 to ba899eb Compare March 25, 2020 13:41

jorisvandenbossche changed the title ~~ARROW-8039: [Python] Use dataset API in existing parquet tests~~ ARROW-8039: [Python] Use dataset API in existing parquet readers and tests Mar 25, 2020

jorisvandenbossche force-pushed the parquet-dataset-experiment branch 2 times, most recently from 55edc7b to 9d6fc4a Compare March 30, 2020 08:34

jorisvandenbossche marked this pull request as ready for review March 30, 2020 09:06

jorisvandenbossche force-pushed the parquet-dataset-experiment branch 2 times, most recently from fa16aa2 to 410798e Compare March 30, 2020 10:04

bkietz requested changes Mar 30, 2020

View reviewed changes

wesm reviewed Mar 31, 2020

View reviewed changes

jorisvandenbossche force-pushed the parquet-dataset-experiment branch from 1e736e7 to eae54f3 Compare March 31, 2020 12:15

bkietz requested changes Mar 31, 2020

View reviewed changes

python/pyarrow/parquet.py Outdated Show resolved Hide resolved

jorisvandenbossche force-pushed the parquet-dataset-experiment branch 2 times, most recently from 974a975 to a5dfddc Compare April 2, 2020 08:33

jorisvandenbossche added 3 commits April 8, 2020 18:08

POC: use dataset API in existing parquet tests

48b7ea5

support old-style filters

7dcd960

add ParquetDatasetV2 shim and use in tests

e502735

jorisvandenbossche and others added 12 commits April 8, 2020 18:10

consolidate read_table/ParquetDataset code + add errors for unsupport…

86498a1

…ed keywords

fix expression syntax + add docstring

22c0e54

fix paths test on Windows

c63d185

Update python/pyarrow/parquet.py

63d5acd

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

Update python/pyarrow/parquet.py

ce5166c

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

Update python/pyarrow/tests/test_parquet.py

cd972ba

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

Update python/pyarrow/tests/test_parquet.py

be7125b

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

consolidate filters docstring

c5176d7

support memory_map

9e028be

feedback

126e023

enable different partitioning schemes

16de776

remove ARROW:schema removal from metadata in read_table for new API

9650f65

jorisvandenbossche force-pushed the parquet-dataset-experiment branch from 6938ef0 to 9650f65 Compare April 8, 2020 16:49

kszucs reviewed Apr 8, 2020

View reviewed changes

python/pyarrow/parquet.py Show resolved Hide resolved

kszucs reviewed Apr 8, 2020

View reviewed changes

python/pyarrow/parquet.py Outdated Show resolved Hide resolved

kszucs approved these changes Apr 8, 2020

View reviewed changes

bkietz approved these changes Apr 8, 2020

View reviewed changes

jorisvandenbossche and others added 4 commits April 9, 2020 13:52

Apply suggestions from code review

608b6d4

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

Apply suggestions from code review

a2c80f8

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

update docstrings

9e721f4

deterministic_row_order helper function

9cbaf3c

bkietz approved these changes Apr 9, 2020

View reviewed changes

bkietz closed this in a3f2678 Apr 9, 2020

jorisvandenbossche deleted the parquet-dataset-experiment branch April 10, 2020 10:56

This was referenced Dec 22, 2020

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

Closed

[C++][Dataset] Optionally encode partition field values as dictionary type #24808

Closed

ARROW-8039: [Python] Use dataset API in existing parquet readers and tests #6303

ARROW-8039: [Python] Use dataset API in existing parquet readers and tests #6303

Uh oh!

Conversation

jorisvandenbossche commented Jan 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 28, 2020

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Mar 26, 2020

Uh oh!

wesm commented Mar 30, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Mar 31, 2020

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Apr 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 28, 2020 •

edited

Loading

jorisvandenbossche commented Apr 30, 2020 •

edited

Loading