Skip to content

[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313

@asfimport

Description

@asfimport

Steps to reproduce:

  1. Generate any DataFrame's pyarrow Schema using Table.from_pandas

  2. Pass the generated schema as input into Table.from_pandas

  3. Causes KeyError: 'index_level_0'

    We did not have this issue with pyarrow==0.11.0 which we used to write many partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce schema going forward that are backwards compatible (i.e. also have 'index_level_0'), so we should not need to re-generate all prior years' partitions when we migrate to 0.15.0.

    We cannot set preserve_index=False, since that effectively deletes 'index_level_0', causing inconsistent schema across earlier partitions that had been written using pyarrow==0.11.0.

     

    import pandas as pd
    import pyarrow as pa
    df = pd.DataFrame() 
    schema = pa.Table.from_pandas(df).schema
    pa_table = pa.Table.from_pandas(df, schema=schema)
    
    Traceback (most recent call last):
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
        return self._engine.get_loc(key)
      File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: '__index_level_0__'
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 408, in _get_columns_to_convert_given_schema
        col = df[name]
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
        return self._getitem_column(key)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
        return self._get_item_cache(key)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
        values = self._data.get(item)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
        loc = self.items.get_loc(item)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
        return self._engine.get_loc(self._maybe_cast_indexer(key))
      File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: '__index_level_0__'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-36-6711a2fcec96>", line 5, in <module>
        pa_table = pa.Table.from_pandas(df, schema=pa.Table.from_pandas(df).schema)
      File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 517, in dataframe_to_arrays
        columns)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 337, in _get_columns_to_convert
        return _get_columns_to_convert_given_schema(df, schema, preserve_index)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 426, in _get_columns_to_convert_given_schema
        "in the columns or index".format(name))
    KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index"
    

Environment: pandas==0.23.4
pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0

Reporter: Tom Goodman
Assignee: Joris Van den Bossche / @jorisvandenbossche

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6999. Please see the migration documentation for further details.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions