Skip to content

Conversation

@milesgranger
Copy link
Contributor

Without this patch, the following is possible:

import pyarrow as pa
import pyarrow.parquet as pq

t = pa.Table.from_pydict({'a': [1,2,3]})
t = t.add_column(0, 'a', pa.array([4, 5, 6]))  # Adding column with same field name

pq.write_table(t, 'file.parquet')  # OK
pq.read_table('file.parquet')  # Error
...
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
a: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

This patch will prevent pq.write_table(...) from writing a table with duplicate field names:

t.write_table(t, 'file.parquet')
...
ArrowInvalid: Cannot write parquet table with duplicate field names: a

@milesgranger milesgranger changed the title ARROW-17388: [C++][Python] Prevent Fields with same name in Schema ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names Aug 22, 2022
@github-actions
Copy link

@milesgranger
Copy link
Contributor Author

@pitrou if this is the correct way to go about it, I'm not sure how to resolve the TestArrowReadWrite.TableWithDuplicateColumns unit test, as it's verifying compatibility with what this PR removes.

@pitrou
Copy link
Member

pitrou commented Aug 22, 2022

So, it seems this is a capability that should be preserved. The problem is the new dataset implementation doesn't allow reading the file back:

>>> pq.read_table('file.parquet', use_legacy_dataset=False)
Traceback (most recent call last):
  [...]
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
a: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

>>> pq.read_table('file.parquet', use_legacy_dataset=True)
<ipython-input-12-6eeebe64658f>:1: FutureWarning: Passing 'use_legacy_dataset=True' to get the legacy behaviour is deprecated as of pyarrow 8.0.0, and the legacy implementation will be removed in a future version.
  pq.read_table('file.parquet', use_legacy_dataset=True)
pyarrow.Table
a: int64
a: int64
----
a: [[4,5,6]]
a: [[1,2,3]]

cc @westonpace @jorisvandenbossche

@westonpace
Copy link
Member

I've actually been poking around this area recently (#13782). I would say this is somewhat related to the problem of "schema evolution". The current behavior is undocumented but attempts to handle some potential variation in schema between files. As a result, field references need to be names, and we lookup each name in the fragment schema to figure out which column to map it to in the dataset schema.

For example, if the fragments have schemas:

Fragment 1
a,b,c

Fragment 2
c,a,b

Dataset schema
b,c,a

And the user asks for "b" then we look for column 1 in fragment 1 and column 2 in fragment 2. This approach breaks down pretty quickly when a fragment has duplicate columns with the same name.

Once #13782 merges then perhaps we could add a "no evolution" option which would be the default if there is only a single fragment. This option would allow for duplicate columns.

What should be returned if the user were to run...

pq.read_table('file.parquet', use_legacy_dataset=False, columns=["a"])

@pitrou
Copy link
Member

pitrou commented Aug 23, 2022

Hmm, it would be nice if this could work even without disabling schema evolution. Perhaps a heuristic is possible?

  • if a field name is unique, do as usual
  • if a field name is non-unique, require that it has the same number of occurrences in both schema, and iterate on those pairs in order

@jorisvandenbossche
Copy link
Member

Indeed, we shouldn't disallow this since Parquet itself allows duplicate field names. And our Parquet reader actually also can read such files (it's only the dataset code that fails):

>>> pq.read_table("file_duplicate.parquet")    # <-- defaults to use the dataset API
...
ArrowInvalid: Multiple matches for FieldRef.Name(a) in ...

>>> pq.ParquetFile("file_duplicate.parquet").read()   # <-- direct usage of Parquet reader works fine
pyarrow.Table
a: int64
a: int64
----
a: [[0,1,2,3,4]]
a: [[5,6,7,8,9]]

@jorisvandenbossche
Copy link
Member

https://issues.apache.org/jira/browse/ARROW-8210 seems to be an existing issue about Dataset support for duplicate field names.

@milesgranger milesgranger deleted the ARROW-17388_catch-multi-matches-FieldRef-name branch August 23, 2022 08:44
@westonpace
Copy link
Member

Hmm, it would be nice if this could work even without disabling schema evolution. Perhaps a heuristic is possible?

if a field name is unique, do as usual
if a field name is non-unique, require that it has the same number of occurrences in both schema, and iterate on those pairs in order

That would work. I'm not really opposed to it. Though it seems like it would be very rare that this is the correct behavior. I think we would just be hiding the corner case rather than really resolving it. Either a user is creating files with consistent column ordering, in which case duplicates are fine, or they are not creating files with consistent column ordering, in which case duplicates are a problem. It would be rather odd for a user that a user has inconsistent column ordering except in the case of duplicate column names.

Indeed, we shouldn't disallow this since Parquet itself allows duplicate field names. And our Parquet reader actually also can read such files (it's only the dataset code that fails):

Yes, it is not at all a problem when you only have one file and I agree the datasets code should be updated to handle this single-file case correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants