-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the dataset(..) factory function:
-
Add ability to pass a user-specified schema with a
schemakeyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method) -
Add
validate_schemaoption to toggle whether the schema is validated against the actual files or not. -
Expose in some way the number of fragments to be inspected when inferring or validating the schema. Not sure yet what the best API for this would be.
Some relevant notes from the original PR: ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687 (comment)
Reporter: Joris Van den Bossche / @jorisvandenbossche
Subtasks:
Related issues:
- [Python][Parquet] improve reading of partitioned parquet datasets whose schema changed (duplicates)
- [Python] add option for taking all columns from all files in pa.dataset (duplicates)
- ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API (is duplicated by)
PRs and other links:
Note: This issue was originally created as ARROW-8221. Please see the migration documentation for further details.