Skip to content

Possible to read categoricals back into Pandas from Parquet using Pyarrow? #1688

@kylebarron

Description

@kylebarron

Apologies if this is solved somewhere in the documentation; I've tried to look through it and through the issues here and on JIRA. If so just close this issue.

Is it possible to have a Pandas dataset with categorical variables, save it to Parquet, and read in those variables as categorical again?

Setup:

In [1]: import fastparquet as fp
In [2]: import pyarrow as pa
In [3]: import pyarrow.parquet as pq
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'A':['a','b','c','a']})
In [6]: df['B'] = df['A'].astype('category')
In [7]: df.dtypes
Out[7]: 
A      object
B    category
dtype: object

Using Pandas to read/write only reads B as categorical using fastparquet, but interestingly, reads that even if it was written by pyarrow:

In [11]: df.to_parquet('pa.parquet', engine='pyarrow')
    ...: df.to_parquet('fp.parquet', engine='fastparquet')

In [12]: pd.read_parquet('pa.parquet', engine='pyarrow').dtypes
Out[12]: 
A    object
B    object
dtype: object

In [13]: pd.read_parquet('pa.parquet', engine='fastparquet').dtypes
Out[13]: 
A      object
B    category
dtype: object

In [14]: pd.read_parquet('fp.parquet', engine='pyarrow').dtypes
Out[14]: 
A    object
B    object
dtype: object

In [15]: pd.read_parquet('fp.parquet', engine='fastparquet').dtypes
Out[15]: 
A      object
B    category
dtype: object

I'm guessing that fastparquet reads the pandas_type option in the metadata and sees categorical.

In [16]: pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
Out[16]: 
pyarrow.Table
A: string
B: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
            b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
            b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
            b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
            b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
            b'0"}'}

It seems that the strings_to_categorical option of pa.Table.to_pandas() doesn't work in this situation either (maybe I'm using it wrong; also I'd prefer to only read as categorical the columns that were originally categorical in Pandas):

In [17]: table = pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
    ...: table.to_pandas(strings_to_categorical=True)
    ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-42387e24e864> in <module>()
      1 table = pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
----> 2 table.to_pandas(strings_to_categorical=True)

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:46331)()

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
    526             )
    527 
--> 528     blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
    529 
    530     # Construct the row index

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, nthreads, memory_pool)
    620 
    621     # Defined above
--> 622     return [_reconstruct_block(item) for item in result]
    623 
    624 

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    620 
    621     # Defined above
--> 622     return [_reconstruct_block(item) for item in result]
    623 
    624 

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _reconstruct_block(item)
    434         cat = pd.Categorical.from_codes(block_arr,
    435                                         categories=item['dictionary'],
--> 436                                         ordered=item['ordered'])
    437         block = _int.make_block(cat, placement=placement,
    438                                 klass=_int.CategoricalBlock,

~/local/anaconda3/lib/python3.6/site-packages/pandas/core/categorical.py in from_codes(cls, codes, categories, ordered)
    619 
    620         if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
--> 621             raise ValueError("codes need to be between -1 and "
    622                              "len(categories)-1")
    623 

ValueError: codes need to be between -1 and len(categories)-1

It seems that the pd.DataFrame > pa.Table > pd.DataFrame conversion keeps categoricals as a dictionary Arrow type, but that the pa.Table > .parquet > pa.Table process loses the categoricals.

In [22]: table = pa.Table.from_pandas(df)

In [23]: table.schema
Out[23]: 
A: string
B: dictionary<values=string, indices=int8, ordered=0>
  dictionary: ["a", "b", "c"]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
            b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
            b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
            b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
            b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
            b'0"}'}

In [24]: table.to_pandas().dtypes
Out[24]: 
A      object
B    category
dtype: object

In [25]: pq.write_table(table, 'table.parquet')

In [26]: pq.ParquetFile('table.parquet').read()
Out[26]: 
pyarrow.Table
A: string
B: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
            b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
            b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
            b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
            b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
            b'0"}'}

Using fastparquet works well at the moment, but

  1. Some files written by fastparquet are unable to be written by pyarrow. I get the trace
    In [3]: df = pf.read()
    ---------------------------------------------------------------------------
    ArrowIOError                              Traceback (most recent call last)
    <ipython-input-3-910db268c9c9> in <module>()
    ----> 1 df = pf.read()
    
    ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
        140             columns, use_pandas_metadata=use_pandas_metadata)
        141         return self.reader.read_all(column_indices=column_indices,
    --> 142                                     nthreads=nthreads)
        143 
        144     def scan_contents(self, columns=None, batch_size=65536):
    
    ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all (/arrow/python/build/temp.linux-x86_64-3.6/_parquet.cxx:12865)()
    
    ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8345)()
    
    ArrowIOError: Unexpected end of stream.
    I'm working with restricted data, but I could try to reproduce this error with sample data.
  2. I'm trying to write a wrapper for the Parquet C++ library that would read/write Parquet files to Stata, and so I'd need all my files to be readable by pyarrow.
  3. Fastparquet is single threaded so I suppose pyarrow could be faster when using multiple cores

This was done using versions:

In [18]: pa.__version__
Out[18]: '0.8.0'

In [20]: fp.__version__
Out[20]: '0.1.4'

In [21]: pd.__version__
Out[21]: '0.22.0'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions