Skip to content

[C++][Parquet] Cannot read Parquet file with map column generated by pyspark / parquet-mr < 1.12 #39540

@bama-chi

Description

@bama-chi

Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to read a parquet file with pandas using 'pyarrow' engine and I'm having a problem while reading it.
the stack trace error :

  File "<stdin>", line 1, in <module>
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pandas/io/parquet.py", line 501, in read_parquet
    return impl.read(
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pandas/io/parquet.py", line 249, in read
    result = self.api.parquet.read_table(
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2956, in read_table
    dataset = _ParquetDatasetV2(
  File "/home/bama/.pyenv/versions/3.10.4/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2496, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 1358, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Could not open Parquet input source '<Buffer>': Logical type Null can not be applied to group node

here is the schema of the parquet file that I'm trying to read:

org.apache.spark.version2.4.7)org.apache.spark.sql.parquet.row.metadata�{"type":"struct","fields":[{"name":"id","type":"string","nullable":true,"metadata":{}},{"name":"uid","type":"string","nullable":true,"metadata":{}},{"name":"params","type":{"type":"map","keyType":"string","valueType":{"type":"array","elementType":"string","containsNull":true},"valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"utc_date","type":"timestamp","nullable":true,"metadata":{}},{"name":"host","type":"string","nullable":true,"metadata":{}},{"name":"customer_id","type":"string","nullable":true,"metadata":{}}]}Wparquet-mr version 1.10.99.7.1.7.0-550 (build 27a2f693f9b09573ead42e85bee2a649ac904119)�!PAR1

otherwise when I'm reading the same file with fastparquet everything goes smoothly

pandas version: 1.5.0
pyarrow version: 14.0.1

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions