Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 4, 2020

Given that the original change (https://issues.apache.org/jira/browse/ARROW-3861 / #7050) breaks dask's reading of partitioned datasets (it doesn't add the partition column to the list of columns to read, but expects it will still be read automatically), it doesn't seem worth it to me to fix this in the "old" ParquetDataset implementation.

But we can keep the "correct" behaviour in the Datasets API - based implementation going forward.

@github-actions
Copy link

github-actions bot commented May 4, 2020

@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.7-dask-latest test-conda-python-3.8-dask-master

@github-actions
Copy link

github-actions bot commented May 4, 2020

Revision: 3e480a9

Submitted crossbow builds: ursa-labs/crossbow @ actions-200

Task Status
test-conda-python-3.7-dask-latest CircleCI
test-conda-python-3.8-dask-master CircleCI

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 4, 2020

So the question comes up if we actually should have the same behaviour in case of use_legacy_dataset=False (the _ParquetDatasetV2 shim).

For me, that depends a bit on what we want to do long term with ParquetDataset. If we want to keep it as "the" ParquetDataset (maybe becoming a subclass of the actual Dataset class then), then I think it should have the "correct" behaviour. If we only see it as a temporary vehicle to get people try it out / have poeple eventually use the pyarrow.dataset API, then it is less important to fix it.

@wesm
Copy link
Member

wesm commented May 4, 2020

This is a regression? If so can you mark it with 0.17.1 (and 1.0.0)

@jorisvandenbossche
Copy link
Member Author

No, the PR that caused the regression was only merged after 0.17

@wesm
Copy link
Member

wesm commented May 5, 2020

For me, that depends a bit on what we want to do long term with ParquetDataset.

IMHO we should be trying to transition people off of this, so that Parquet format isn't necessarily treated separately from general datasets (which may contain different format files). We can keep discussing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants