-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687
Conversation
15be905 to
3e70fb7
Compare
3e70fb7 to
9a8dcea
Compare
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General approach looks good, a few comments:
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked at the doc comments, the interaction between the different options is not yet fully clear to me
ca2a6cb to
4a621de
Compare
4a621de to
73310eb
Compare
|
I don't think the c/glib failure is related. |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, merging
(the CI failure is https://issues.apache.org/jira/browse/ARROW-8215 )
| assert options.partition_base_dir == 'subdir' | ||
| assert options.ignore_prefixes == ['.', '_'] | ||
| assert options.exclude_invalid_files is True | ||
| assert options.exclude_invalid_files is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, validates require reading files.
|
This still needs more detailed Python (and R) bindings, right? |
|
That was my understanding from when we discussed Monday as well. |
|
Opened an issue for that https://issues.apache.org/jira/browse/ARROW-8221 (at least for the Python side, R probably needs a similar one?) |
|
Still some questions about this:
|
| /// Indicate if the given Schema (when specified), should be validated against | ||
| /// the fragments' schemas. `inspect_options` will control how many fragments | ||
| /// are checked. | ||
| bool validate_fragments = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would validate_schema be a better name for this?
You get no schema, or only the one in the partitioning if there's one.
For |
This PR aims to improve the latency of the discovery process. Notably, it selects "fast" defaults over "safe" defaults.
InspectOptionswhich limits the number of fragments inspected to infer the schema, it defaults to one fragment.FinishOptionswhich toggles if validation of the optional schema and also controls the number of fragments it validates with. It defaults to disabling validation.This gives a noticeable speedup when the fragments have a uniform schema.