[Python][Dataset] Expose schema inference / validation options in the factory

ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the `dataset(..)` factory function:

- Add ability to pass a user-specified schema with a `schema` keyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method)
- Add `validate_schema` option to toggle whether the schema is validated against the actual files or not.
- Expose in some way the number of fragments to be inspected when inferring or validating the schema. Not sure yet what the best API for this would be. 
  
  Some relevant notes from the original PR: https://github.com/apache/arrow/pull/6687#issuecomment-604394407

**Reporter**: [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-8221) / @jorisvandenbossche
#### Subtasks:
- [X] [[Python][Dataset] Passthrough schema to Factory.finish() in dataset() function](https://github.com/apache/arrow/issues/24485)
#### Related issues:
- [[Python][Parquet] improve reading of partitioned parquet datasets whose schema changed](https://github.com/apache/arrow/issues/25089) (duplicates)
- [[Python] add option for taking all columns from all files in pa.dataset](https://github.com/apache/arrow/issues/25528) (duplicates)
- [ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API](https://github.com/apache/arrow/issues/32585) (is duplicated by)
#### PRs and other links:
- [GitHub Pull Request #8912](https://github.com/apache/arrow/pull/8912)

<sub>**Note**: *This issue was originally created as [ARROW-8221](https://issues.apache.org/jira/browse/ARROW-8221). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python][Dataset] Expose schema inference / validation options in the factory #24418

Subtasks:

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][Dataset] Expose schema inference / validation options in the factory #24418

Description

Subtasks:

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions