Skip to content

Conversation

@sanjibansg
Copy link
Contributor

This PR adds the feature of raising invalid status if the fragment schema and partitioning schema have any common fields, i.e. provided partitioning schema for reading is not the one used for writing.

@github-actions
Copy link

Comment on lines 256 to 272
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will catch a problem if the error is noticed during inspection but sometimes we don't use all fragments to inspect a dataset and other times we might not do dataset inspection at all (e.g. if the user provides the dataset schema and the partitioning schema). I wonder if we might want to check somewhere at scan time instead of during discovery.

Also, in some cases, maybe this is not always a bad thing. I seem to recall users would sometimes store the schema information in a column in the file in addition to the filename. Maybe a better behavior would be to silently ignore the column in the file if there is partitioning information that specifies a given column. Or at least to make it configurable (ignore vs error vs two columns with the same name). @thisisnic any preference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be either silently ignoring it or making it configurable (with one of the configurable options being to silently ignore it).

Users might inherit poorly-designed datasets that they have no control over how they've been written, and I'd hate for it to be impossible for them to be able to work with them in Arrow and just get an error.

I'm not sure how/if having a dataset with 2 columns with the same name would work in R as I think it'd trigger an error somewhere (if it didn't, I think we'd maybe want to make it do so?), but if it'd be useful to have that as a feature, we can just catch it in R and add our own custom error, so I don't see a problem in that.

Silently ignoring would be fine too, as we can always add our own error in R if we want to warn users that there's duplication, and it at least allows execution.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@westonpace westonpace reopened this May 16, 2023
@westonpace westonpace requested a review from AlenkaF as a code owner May 16, 2023 04:20
@github-actions github-actions bot added the awaiting review Awaiting review label Jun 5, 2023
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions bot closed this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Awaiting review Component: C++ Component: Python Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants