ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names #13938

milesgranger · 2022-08-22T11:19:12Z

Without this patch, the following is possible:

import pyarrow as pa
import pyarrow.parquet as pq

t = pa.Table.from_pydict({'a': [1,2,3]})
t = t.add_column(0, 'a', pa.array([4, 5, 6]))  # Adding column with same field name

pq.write_table(t, 'file.parquet')  # OK
pq.read_table('file.parquet')  # Error
...
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
a: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

This patch will prevent pq.write_table(...) from writing a table with duplicate field names:

t.write_table(t, 'file.parquet')
...
ArrowInvalid: Cannot write parquet table with duplicate field names: a

github-actions · 2022-08-22T11:19:35Z

https://issues.apache.org/jira/browse/ARROW-17388

milesgranger · 2022-08-22T12:37:18Z

@pitrou if this is the correct way to go about it, I'm not sure how to resolve the TestArrowReadWrite.TableWithDuplicateColumns unit test, as it's verifying compatibility with what this PR removes.

pitrou · 2022-08-22T15:37:03Z

So, it seems this is a capability that should be preserved. The problem is the new dataset implementation doesn't allow reading the file back:

>>> pq.read_table('file.parquet', use_legacy_dataset=False)
Traceback (most recent call last):
  [...]
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
a: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

>>> pq.read_table('file.parquet', use_legacy_dataset=True)
<ipython-input-12-6eeebe64658f>:1: FutureWarning: Passing 'use_legacy_dataset=True' to get the legacy behaviour is deprecated as of pyarrow 8.0.0, and the legacy implementation will be removed in a future version.
  pq.read_table('file.parquet', use_legacy_dataset=True)
pyarrow.Table
a: int64
a: int64
----
a: [[4,5,6]]
a: [[1,2,3]]

cc @westonpace @jorisvandenbossche

westonpace · 2022-08-22T20:29:06Z

I've actually been poking around this area recently (#13782). I would say this is somewhat related to the problem of "schema evolution". The current behavior is undocumented but attempts to handle some potential variation in schema between files. As a result, field references need to be names, and we lookup each name in the fragment schema to figure out which column to map it to in the dataset schema.

For example, if the fragments have schemas:

Fragment 1
a,b,c

Fragment 2
c,a,b

Dataset schema
b,c,a

And the user asks for "b" then we look for column 1 in fragment 1 and column 2 in fragment 2. This approach breaks down pretty quickly when a fragment has duplicate columns with the same name.

Once #13782 merges then perhaps we could add a "no evolution" option which would be the default if there is only a single fragment. This option would allow for duplicate columns.

What should be returned if the user were to run...

pq.read_table('file.parquet', use_legacy_dataset=False, columns=["a"])

pitrou · 2022-08-23T07:21:50Z

Hmm, it would be nice if this could work even without disabling schema evolution. Perhaps a heuristic is possible?

if a field name is unique, do as usual
if a field name is non-unique, require that it has the same number of occurrences in both schema, and iterate on those pairs in order

jorisvandenbossche · 2022-08-23T07:50:13Z

Indeed, we shouldn't disallow this since Parquet itself allows duplicate field names. And our Parquet reader actually also can read such files (it's only the dataset code that fails):

>>> pq.read_table("file_duplicate.parquet")    # <-- defaults to use the dataset API
...
ArrowInvalid: Multiple matches for FieldRef.Name(a) in ...

>>> pq.ParquetFile("file_duplicate.parquet").read()   # <-- direct usage of Parquet reader works fine
pyarrow.Table
a: int64
a: int64
----
a: [[0,1,2,3,4]]
a: [[5,6,7,8,9]]

jorisvandenbossche · 2022-08-23T07:56:46Z

https://issues.apache.org/jira/browse/ARROW-8210 seems to be an existing issue about Dataset support for duplicate field names.

westonpace · 2022-08-23T16:59:28Z

Hmm, it would be nice if this could work even without disabling schema evolution. Perhaps a heuristic is possible?

if a field name is unique, do as usual
if a field name is non-unique, require that it has the same number of occurrences in both schema, and iterate on those pairs in order

That would work. I'm not really opposed to it. Though it seems like it would be very rare that this is the correct behavior. I think we would just be hiding the corner case rather than really resolving it. Either a user is creating files with consistent column ordering, in which case duplicates are fine, or they are not creating files with consistent column ordering, in which case duplicates are a problem. It would be rather odd for a user that a user has inconsistent column ordering except in the case of duplicate column names.

Indeed, we shouldn't disallow this since Parquet itself allows duplicate field names. And our Parquet reader actually also can read such files (it's only the dataset code that fails):

Yes, it is not at all a problem when you only have one file and I agree the datasets code should be updated to handle this single-file case correctly.

milesgranger added 2 commits August 22, 2022 11:29

Prevent Fields with same name in Schema

eb218b6

Error on WriteTable if duplicate field names

ee88edc

milesgranger changed the title ~~ARROW-17388: [C++][Python] Prevent Fields with same name in Schema~~ ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names Aug 22, 2022

github-actions bot added Component: C++ Component: Python Component: Parquet labels Aug 22, 2022

milesgranger closed this Aug 23, 2022

milesgranger deleted the ARROW-17388_catch-multi-matches-FieldRef-name branch August 23, 2022 08:44

westonpace mentioned this pull request Nov 22, 2022

ARROW-17288: [C++] Adapt the CSV file format to the new scan API #14663

Merged

asfimport mentioned this pull request Aug 23, 2022

[Python] Prevent corrupting files with Multiple matches for FieldRef.Name #32660

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names #13938

ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names #13938

Uh oh!

milesgranger commented Aug 22, 2022

Uh oh!

github-actions bot commented Aug 22, 2022

Uh oh!

milesgranger commented Aug 22, 2022

Uh oh!

pitrou commented Aug 22, 2022 •

edited

Loading

Uh oh!

westonpace commented Aug 22, 2022

Uh oh!

pitrou commented Aug 23, 2022

Uh oh!

jorisvandenbossche commented Aug 23, 2022

Uh oh!

jorisvandenbossche commented Aug 23, 2022

Uh oh!

westonpace commented Aug 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names #13938

ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names #13938

Uh oh!

Conversation

milesgranger commented Aug 22, 2022

Uh oh!

github-actions bot commented Aug 22, 2022

Uh oh!

milesgranger commented Aug 22, 2022

Uh oh!

pitrou commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Aug 22, 2022

Uh oh!

pitrou commented Aug 23, 2022

Uh oh!

jorisvandenbossche commented Aug 23, 2022

Uh oh!

jorisvandenbossche commented Aug 23, 2022

Uh oh!

westonpace commented Aug 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pitrou commented Aug 22, 2022 •

edited

Loading