Skip to content

[Python] Map data types doesn't work from Arrow to Parquet #25855

@asfimport

Description

@asfimport

Hi,

I'm having problems using 'map' data type in Arrow/parquet/pandas.

I'm able to convert a pandas data frame to Arrow with a map data type.

When I write Arrow to Parquet, it seems to work, but I'm not sure if the data type is written correctly.

When I read back Parquet to Arrow, it fails saying "reading list of structs" is not supported. It seems that map is stored as list of structs.

There are two problems here:

  1. Map data type doesn't work from Arrow -> Pandas. Fixed in ARROW-10151

  2. Map data type doesn't get written to or read from Arrow -> Parquet.

    Questions:

    1. Am I doing something wrong? Is there a way to get these to work? 

    2. If these are unsupported features, will this be fixed in a future version? Do you plans or ETA?

    The following code example (followed by output) should demonstrate the issues:

    I'm using Arrow 1.0.0 and Pandas 1.0.5.

    Thanks!

    Mayur

    $ cat arrowtest.py
    
    import pyarrow as pa
    import pandas as pd
    import pyarrow.parquet as pq
    import traceback as tb
    import io
    
    print(f'PyArrow Version = {pa.__version__}')
    print(f'Pandas Version = {pd.__version__}')
    
    df1 = pd.DataFrame({'a': [[('b', '2')]]})
    print(f'df1')
    print(f'{df1}')
    
    print(f'Pandas -> Arrow')
    try:
        t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', pa.map_(pa.string(), pa.string()))]))
        print('PASSED')
        print(t1)
    except:
        print(f'FAILED')
        tb.print_exc()
    
    print(f'Arrow -> Pandas')
    try:
        t1.to_pandas()
        print('PASSED')
    except:
        print(f'FAILED')
        tb.print_exc()print(f'Arrow -> Parquet')
    
    fh = io.BytesIO()
    try:
        pq.write_table(t1, fh)
        print('PASSED')
    except:
        print('FAILED')
        tb.print_exc()
        
    print(f'Parquet -> Arrow')
    try:
        t2 = pq.read_table(source=fh)
        print('PASSED')
        print(t2)
    except:
        print('FAILED')
        tb.print_exc()
    $ python3.6 arrowtest.py
    PyArrow Version = 1.0.0 
    Pandas Version = 1.0.5 
    df1 
    a 0 [(b, 2)] 
     
    Pandas -> Arrow 
    PASSED 
    pyarrow.Table 
    a: map<string, string>
     child 0, entries: struct<key: string not null, value: string> not null
     child 0, key: string not null
     child 1, value: string 
     
    Arrow -> Pandas 
    FAILED 
    Traceback (most recent call last):
    File "arrowtest.py", line 26, in <module> t1.to_pandas() 
    File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas 
    File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) 
    File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 1115, in _table_to_blocks list(extension_columns.keys())) 
    File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type map<string, string> is known. 
     
    Arrow -> Parquet 
    PASSED 
     
    Parquet -> Arrow 
    FAILED 
    Traceback (most recent call last): File "arrowtest.py", line 43, in <module> t2 = pq.read_table(source=fh) 
    File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in read_table use_pandas_metadata=use_pandas_metadata) 
    File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in read use_threads=use_threads 
    File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
    File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
    File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status 
    File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list<key_value: struct<key: string not null, value: string> not null> not null

    Updated to indicate to Pandas conversion done, but not yet for Parquet.

Reporter: Mayur Srivastava / @mayuropensource

Related issues:

Note: This issue was originally created as ARROW-9812. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions