-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Assemble a minimal ParquetDataset shim backed by pyarrow.dataset.*. Replace the existing ParquetDataset with the shim by default, allow opt-out for users who need the current ParquetDataset
This is mostly exploratory to see which of the python tests fail
Reporter: Ben Kietzman / @bkietz
Assignee: Joris Van den Bossche / @jorisvandenbossche
Related issues:
- [Python] More graceful reading of empty String columns in ParquetDataset (is depended upon by)
- [Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read (is depended upon by)
- [Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset (is depended upon by)
- [Python] ParquetDataset().read columns argument always returns partition column (is depended upon by)
- [Python] Underscores in partition (string) values are dropped when reading dataset (is depended upon by)
- [Python] better error message on creating ParquetDataset from empty directory (is depended upon by)
- [Python] raise error message when passing invalid filter in parquet reading (is depended upon by)
- [C++][Python] Support AWS Firehose partition_scheme implementation for Parquet datasets (is depended upon by)
- [C++][Dataset] Automatically detect boolean partition columns (is depended upon by)
- [Python][C++][Parquet] Support reading Parquet files having a permutation of column order (is depended upon by)
- [Python] Improved workflow for loading an arbitrary collection of Parquet files (is depended upon by)
- [Python] RowGroup filtering on file level (is depended upon by)
PRs and other links:
Note: This issue was originally created as ARROW-8039. Please see the migration documentation for further details.