-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently #14444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/arrow/dataset/discovery.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will catch a problem if the error is noticed during inspection but sometimes we don't use all fragments to inspect a dataset and other times we might not do dataset inspection at all (e.g. if the user provides the dataset schema and the partitioning schema). I wonder if we might want to check somewhere at scan time instead of during discovery.
Also, in some cases, maybe this is not always a bad thing. I seem to recall users would sometimes store the schema information in a column in the file in addition to the filename. Maybe a better behavior would be to silently ignore the column in the file if there is partitioning information that specifies a given column. Or at least to make it configurable (ignore vs error vs two columns with the same name). @thisisnic any preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be either silently ignoring it or making it configurable (with one of the configurable options being to silently ignore it).
Users might inherit poorly-designed datasets that they have no control over how they've been written, and I'd hate for it to be impossible for them to be able to work with them in Arrow and just get an error.
I'm not sure how/if having a dataset with 2 columns with the same name would work in R as I think it'd trigger an error somewhere (if it didn't, I think we'd maybe want to make it do so?), but if it'd be useful to have that as a feature, we can just catch it in R and add our own custom error, so I don't see a problem in that.
Silently ignoring would be fine too, as we can always add our own error in R if we want to warn users that there's duplication, and it at least allows execution.
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
670e5c1 to
72ff24b
Compare
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
This PR adds the feature of raising invalid status if the fragment schema and partitioning schema have any common fields, i.e. provided partitioning schema for reading is not the one used for writing.