Skip to content

[C++][Dataset] Handling of duplicate columns in Dataset factory and scanning #24407

@asfimport

Description

@asfimport

While testing duplicate column names, I ran into multiple issues:

  • Factory fails if there are duplicate columns, even for a single file
  • In addition, we should also fix and/or test that factory works for duplicate columns if the schema's are equal
  • Once a Dataset with duplicated columns is created, scanning without any column projection fails

My python reproducer:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.fs

# create single parquet file with duplicated column names
table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 9])], names=['a', 'b', 'a'])
pq.write_table(table, "data_duplicate_columns.parquet")

Factory fails:

dataset = ds.dataset("data_duplicate_columns.parquet", format="parquet")
...
~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, filesystem, partitioning, format)
    346 
    347     factories = [_ensure_factory(f, **kwargs) for f in paths_or_factories]
--> 348     return UnionDatasetFactory(factories).finish()
    349 
    350 

ArrowInvalid: Can't unify schema with duplicate field names.

And when creating a Dataset manually:

schema = pa.schema([('a', 'int64'), ('b', 'int64'), ('a', 'int64')])
dataset = ds.FileSystemDataset(
    schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
    [str(basedir / "data_duplicate_columns.parquet")], [ds.ScalarExpression(True)])

then scanning fails:

>>> dataset.to_table()
...
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
b: int64
a: int64

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-8210. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions