-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names #13938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names #13938
Conversation
|
@pitrou if this is the correct way to go about it, I'm not sure how to resolve the |
|
So, it seems this is a capability that should be preserved. The problem is the new dataset implementation doesn't allow reading the file back: >>> pq.read_table('file.parquet', use_legacy_dataset=False)
Traceback (most recent call last):
[...]
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
a: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string
>>> pq.read_table('file.parquet', use_legacy_dataset=True)
<ipython-input-12-6eeebe64658f>:1: FutureWarning: Passing 'use_legacy_dataset=True' to get the legacy behaviour is deprecated as of pyarrow 8.0.0, and the legacy implementation will be removed in a future version.
pq.read_table('file.parquet', use_legacy_dataset=True)
pyarrow.Table
a: int64
a: int64
----
a: [[4,5,6]]
a: [[1,2,3]] |
|
I've actually been poking around this area recently (#13782). I would say this is somewhat related to the problem of "schema evolution". The current behavior is undocumented but attempts to handle some potential variation in schema between files. As a result, field references need to be names, and we lookup each name in the fragment schema to figure out which column to map it to in the dataset schema. For example, if the fragments have schemas: Fragment 1 Fragment 2 Dataset schema And the user asks for "b" then we look for column 1 in fragment 1 and column 2 in fragment 2. This approach breaks down pretty quickly when a fragment has duplicate columns with the same name. Once #13782 merges then perhaps we could add a "no evolution" option which would be the default if there is only a single fragment. This option would allow for duplicate columns. What should be returned if the user were to run... |
|
Hmm, it would be nice if this could work even without disabling schema evolution. Perhaps a heuristic is possible?
|
|
Indeed, we shouldn't disallow this since Parquet itself allows duplicate field names. And our Parquet reader actually also can read such files (it's only the dataset code that fails): |
|
https://issues.apache.org/jira/browse/ARROW-8210 seems to be an existing issue about Dataset support for duplicate field names. |
That would work. I'm not really opposed to it. Though it seems like it would be very rare that this is the correct behavior. I think we would just be hiding the corner case rather than really resolving it. Either a user is creating files with consistent column ordering, in which case duplicates are fine, or they are not creating files with consistent column ordering, in which case duplicates are a problem. It would be rather odd for a user that a user has inconsistent column ordering except in the case of duplicate column names.
Yes, it is not at all a problem when you only have one file and I agree the datasets code should be updated to handle this single-file case correctly. |
Without this patch, the following is possible:
This patch will prevent
pq.write_table(...)from writing a table with duplicate field names: