Skip to content

[Python][Dataset] Expose schema inference / validation options in the factory #24418

@asfimport

Description

@asfimport

ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the dataset(..) factory function:

  • Add ability to pass a user-specified schema with a schema keyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method)

  • Add validate_schema option to toggle whether the schema is validated against the actual files or not.

  • Expose in some way the number of fragments to be inspected when inferring or validating the schema. Not sure yet what the best API for this would be.

    Some relevant notes from the original PR: ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687 (comment)

Reporter: Joris Van den Bossche / @jorisvandenbossche

Subtasks:

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8221. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions