ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility #7096

jorisvandenbossche · 2020-05-04T12:14:41Z

Given that the original change (https://issues.apache.org/jira/browse/ARROW-3861 / #7050) breaks dask's reading of partitioned datasets (it doesn't add the partition column to the list of columns to read, but expects it will still be read automatically), it doesn't seem worth it to me to fix this in the "old" ParquetDataset implementation.

But we can keep the "correct" behaviour in the Datasets API - based implementation going forward.

…de partition column for dask compatibility

github-actions · 2020-05-04T12:16:33Z

https://issues.apache.org/jira/browse/ARROW-8644

jorisvandenbossche · 2020-05-04T12:25:38Z

@github-actions crossbow submit test-conda-python-3.7-dask-latest test-conda-python-3.8-dask-master

github-actions · 2020-05-04T12:26:53Z

Revision: 3e480a9

Submitted crossbow builds: ursa-labs/crossbow @ actions-200

Task	Status
test-conda-python-3.7-dask-latest
test-conda-python-3.8-dask-master

jorisvandenbossche · 2020-05-04T18:40:04Z

So the question comes up if we actually should have the same behaviour in case of use_legacy_dataset=False (the _ParquetDatasetV2 shim).

For me, that depends a bit on what we want to do long term with ParquetDataset. If we want to keep it as "the" ParquetDataset (maybe becoming a subclass of the actual Dataset class then), then I think it should have the "correct" behaviour. If we only see it as a temporary vehicle to get people try it out / have poeple eventually use the pyarrow.dataset API, then it is less important to fix it.

wesm · 2020-05-04T23:39:18Z

This is a regression? If so can you mark it with 0.17.1 (and 1.0.0)

jorisvandenbossche · 2020-05-05T06:28:59Z

No, the PR that caused the regression was only merged after 0.17

wesm · 2020-05-05T21:43:03Z

For me, that depends a bit on what we want to do long term with ParquetDataset.

IMHO we should be trying to transition people off of this, so that Parquet format isn't necessarily treated separately from general datasets (which may contain different format files). We can keep discussing this

ARROW-8644: [Python] Restore ParquetDataset behaviour to always inclu…

3e480a9

…de partition column for dask compatibility

bkietz approved these changes May 4, 2020

View reviewed changes

add note about different behaviour

6ceaf67

wesm closed this in fb4d57a May 5, 2020

asfimport mentioned this pull request May 5, 2020

[Python] Dask integration tests failing due to change in not including partition columns #24805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility #7096

ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility #7096

Uh oh!

jorisvandenbossche commented May 4, 2020 •

edited

Loading

Uh oh!

github-actions bot commented May 4, 2020

Uh oh!

jorisvandenbossche commented May 4, 2020

Uh oh!

github-actions bot commented May 4, 2020

Uh oh!

jorisvandenbossche commented May 4, 2020 •

edited

Loading

Uh oh!

wesm commented May 4, 2020

Uh oh!

jorisvandenbossche commented May 5, 2020

Uh oh!

wesm commented May 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility #7096

ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility #7096

Uh oh!

Conversation

jorisvandenbossche commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 4, 2020

Uh oh!

jorisvandenbossche commented May 4, 2020

Uh oh!

github-actions bot commented May 4, 2020

Uh oh!

jorisvandenbossche commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented May 4, 2020

Uh oh!

jorisvandenbossche commented May 5, 2020

Uh oh!

wesm commented May 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorisvandenbossche commented May 4, 2020 •

edited

Loading

jorisvandenbossche commented May 4, 2020 •

edited

Loading