ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently #14444

sanjibansg · 2022-10-18T07:29:04Z

This PR adds the feature of raising invalid status if the fragment schema and partitioning schema have any common fields, i.e. provided partitioning schema for reading is not the one used for writing.

github-actions · 2022-10-18T07:29:33Z

https://issues.apache.org/jira/browse/ARROW-17784

westonpace · 2022-11-08T00:49:17Z

cpp/src/arrow/dataset/discovery.cc

This will catch a problem if the error is noticed during inspection but sometimes we don't use all fragments to inspect a dataset and other times we might not do dataset inspection at all (e.g. if the user provides the dataset schema and the partitioning schema). I wonder if we might want to check somewhere at scan time instead of during discovery.

Also, in some cases, maybe this is not always a bad thing. I seem to recall users would sometimes store the schema information in a column in the file in addition to the filename. Maybe a better behavior would be to silently ignore the column in the file if there is partitioning information that specifies a given column. Or at least to make it configurable (ignore vs error vs two columns with the same name). @thisisnic any preference?

My preference would be either silently ignoring it or making it configurable (with one of the configurable options being to silently ignore it).

Users might inherit poorly-designed datasets that they have no control over how they've been written, and I'd hate for it to be impossible for them to be able to work with them in Arrow and just get an error.

I'm not sure how/if having a dataset with 2 columns with the same name would work in R as I think it'd trigger an error somewhere (if it didn't, I think we'd maybe want to make it do so?), but if it'd be useful to have that as a feature, we can just catch it in R and add our own custom error, so I don't see a problem in that.

Silently ignoring would be fine too, as we can always add our own error in R if we want to warn users that there's duplication, and it at least allows execution.

amol- · 2023-03-30T17:09:31Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

github-actions · 2025-11-18T11:23:13Z

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

github-actions bot added Component: C++ Component: Python labels Oct 18, 2022

raulcd mentioned this pull request Oct 18, 2022

ARROW-18083: [C++] Bump vendored zlib version #14446

Merged

westonpace self-requested a review October 21, 2022 04:03

westonpace reviewed Nov 8, 2022

View reviewed changes

amol- closed this Mar 30, 2023

westonpace reopened this May 16, 2023

westonpace requested a review from AlenkaF as a code owner May 16, 2023 04:20

sanjibansg added 3 commits May 22, 2023 11:20

feat: raise error if partitioning field present in fragment

44c3ed5

feat: testing with SchemaBuilder::CONFLICT_APPEND

63a764b

feat: checking for data type while discovery

72ff24b

sanjibansg force-pushed the fix-partitioning branch from 670e5c1 to 72ff24b Compare June 5, 2023 04:42

github-actions bot added the awaiting review Awaiting review label Jun 5, 2023

github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025

github-actions bot closed this Dec 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently #14444

ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently #14444

Uh oh!

sanjibansg commented Oct 18, 2022

Uh oh!

github-actions bot commented Oct 18, 2022

Uh oh!

westonpace Nov 8, 2022

Uh oh!

thisisnic Nov 8, 2022

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently #14444

ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently #14444

Uh oh!

Conversation

sanjibansg commented Oct 18, 2022

Uh oh!

github-actions bot commented Oct 18, 2022

Uh oh!

westonpace Nov 8, 2022

Choose a reason for hiding this comment

Uh oh!

thisisnic Nov 8, 2022

Choose a reason for hiding this comment

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants