-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8221: [Python][Dataset] Expose schema inference/validation factory options through the validate_schema keyword #8912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do datasets guarantee that the first file in alphabetical order is used to infer the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the file paths get sorted:
arrow/cpp/src/arrow/dataset/discovery.cc
Line 205 in 48fee66
| std::sort(files.begin(), files.end(), fs::FileInfo::ByPath()); |
(now, whether this should maybe rather be a "natural" sort is another issue ..)
d4608a9 to
356c300
Compare
|
Is this still targeted for 4.0? It needs a rebase if so, but otherwise looks good. |
…ctory options through the validate_schema keyword
0422334 to
1839de0
Compare
|
Rebased now. @lidavidm I am still a bit in doubt about the exact API. |
|
Naming it
|
|
@jorisvandenbossche What is the status on this? |
|
@jorisvandenbossche just a gentle nudge :) |
|
Yeah, sorry for the slow follow-up here. It was on my to do list to have a look at today.
But for this last case, you might still have the options of inferring from the first fragment, or reading the schema of all fragments and unifying them (or erroring when they can't be unified). So if we have eg a |
|
That sounds reasonable to me. |
|
@jorisvandenbossche Are you planning to push this forward? |
|
@jorisvandenbossche shall we close this as stale? |
|
Ping @jorisvandenbossche : can you make a decision on this? |
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
The C++
FileSystemDatasetFactory::Finishmethod handles the schema inference or validation with two options:InspectOptions::fragmentsto indicate the number of fragments to use when inferring or validating the schema (default of 1), and theFinishOptions::validate_fragmentsto indicate whether to validate the specified schema (when not inferred).For now, I decided to combine this in a single keyword on the Python side (
validate_schema). This avoids adding 2 inter-dependent keywords for this, and makes it easier to express some typical use cases (eg validate the specified schema with all fragments is nowvalidate_schema=Trueinstead ofvalidate_schema=True, fragments=-1). On the other hand, it gives a single keyword that accepts both boolean or int (which is not super clean). So this is certainly up for discussion.