-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
Description
While testing duplicate column names, I ran into multiple issues:
- Factory fails if there are duplicate columns, even for a single file
- In addition, we should also fix and/or test that factory works for duplicate columns if the schema's are equal
- Once a Dataset with duplicated columns is created, scanning without any column projection fails
—
My python reproducer:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.fs
# create single parquet file with duplicated column names
table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 9])], names=['a', 'b', 'a'])
pq.write_table(table, "data_duplicate_columns.parquet")Factory fails:
dataset = ds.dataset("data_duplicate_columns.parquet", format="parquet")
...
~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, filesystem, partitioning, format)
346
347 factories = [_ensure_factory(f, **kwargs) for f in paths_or_factories]
--> 348 return UnionDatasetFactory(factories).finish()
349
350
ArrowInvalid: Can't unify schema with duplicate field names.And when creating a Dataset manually:
schema = pa.schema([('a', 'int64'), ('b', 'int64'), ('a', 'int64')])
dataset = ds.FileSystemDataset(
schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
[str(basedir / "data_duplicate_columns.parquet")], [ds.ScalarExpression(True)])then scanning fails:
>>> dataset.to_table()
...
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
b: int64
a: int64Reporter: Joris Van den Bossche / @jorisvandenbossche
Related issues:
- [Python] Prevent corrupting files with Multiple matches for FieldRef.Name (is related to)
- [C++][Dataset] Ensure that dataset code is robust to schemas with duplicate field names (is related to)
Note: This issue was originally created as ARROW-8210. Please see the migration documentation for further details.