-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Apologies if this is solved somewhere in the documentation; I've tried to look through it and through the issues here and on JIRA. If so just close this issue.
Is it possible to have a Pandas dataset with categorical variables, save it to Parquet, and read in those variables as categorical again?
Setup:
In [1]: import fastparquet as fp
In [2]: import pyarrow as pa
In [3]: import pyarrow.parquet as pq
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'A':['a','b','c','a']})
In [6]: df['B'] = df['A'].astype('category')
In [7]: df.dtypes
Out[7]:
A object
B category
dtype: objectUsing Pandas to read/write only reads B as categorical using fastparquet, but interestingly, reads that even if it was written by pyarrow:
In [11]: df.to_parquet('pa.parquet', engine='pyarrow')
...: df.to_parquet('fp.parquet', engine='fastparquet')
In [12]: pd.read_parquet('pa.parquet', engine='pyarrow').dtypes
Out[12]:
A object
B object
dtype: object
In [13]: pd.read_parquet('pa.parquet', engine='fastparquet').dtypes
Out[13]:
A object
B category
dtype: object
In [14]: pd.read_parquet('fp.parquet', engine='pyarrow').dtypes
Out[14]:
A object
B object
dtype: object
In [15]: pd.read_parquet('fp.parquet', engine='fastparquet').dtypes
Out[15]:
A object
B category
dtype: objectI'm guessing that fastparquet reads the pandas_type option in the metadata and sees categorical.
In [16]: pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
Out[16]:
pyarrow.Table
A: string
B: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
b'0"}'}It seems that the strings_to_categorical option of pa.Table.to_pandas() doesn't work in this situation either (maybe I'm using it wrong; also I'd prefer to only read as categorical the columns that were originally categorical in Pandas):
In [17]: table = pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
...: table.to_pandas(strings_to_categorical=True)
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-42387e24e864> in <module>()
1 table = pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
----> 2 table.to_pandas(strings_to_categorical=True)
~/local/anaconda3/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:46331)()
~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
526 )
527
--> 528 blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
529
530 # Construct the row index
~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, nthreads, memory_pool)
620
621 # Defined above
--> 622 return [_reconstruct_block(item) for item in result]
623
624
~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
620
621 # Defined above
--> 622 return [_reconstruct_block(item) for item in result]
623
624
~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _reconstruct_block(item)
434 cat = pd.Categorical.from_codes(block_arr,
435 categories=item['dictionary'],
--> 436 ordered=item['ordered'])
437 block = _int.make_block(cat, placement=placement,
438 klass=_int.CategoricalBlock,
~/local/anaconda3/lib/python3.6/site-packages/pandas/core/categorical.py in from_codes(cls, codes, categories, ordered)
619
620 if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
--> 621 raise ValueError("codes need to be between -1 and "
622 "len(categories)-1")
623
ValueError: codes need to be between -1 and len(categories)-1It seems that the pd.DataFrame > pa.Table > pd.DataFrame conversion keeps categoricals as a dictionary Arrow type, but that the pa.Table > .parquet > pa.Table process loses the categoricals.
In [22]: table = pa.Table.from_pandas(df)
In [23]: table.schema
Out[23]:
A: string
B: dictionary<values=string, indices=int8, ordered=0>
dictionary: ["a", "b", "c"]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
b'0"}'}
In [24]: table.to_pandas().dtypes
Out[24]:
A object
B category
dtype: object
In [25]: pq.write_table(table, 'table.parquet')
In [26]: pq.ParquetFile('table.parquet').read()
Out[26]:
pyarrow.Table
A: string
B: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
b'0"}'}Using fastparquet works well at the moment, but
- Some files written by fastparquet are unable to be written by pyarrow. I get the trace
I'm working with restricted data, but I could try to reproduce this error with sample data.
In [3]: df = pf.read() --------------------------------------------------------------------------- ArrowIOError Traceback (most recent call last) <ipython-input-3-910db268c9c9> in <module>() ----> 1 df = pf.read() ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata) 140 columns, use_pandas_metadata=use_pandas_metadata) 141 return self.reader.read_all(column_indices=column_indices, --> 142 nthreads=nthreads) 143 144 def scan_contents(self, columns=None, batch_size=65536): ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all (/arrow/python/build/temp.linux-x86_64-3.6/_parquet.cxx:12865)() ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8345)() ArrowIOError: Unexpected end of stream.
- I'm trying to write a wrapper for the Parquet C++ library that would read/write Parquet files to Stata, and so I'd need all my files to be readable by pyarrow.
- Fastparquet is single threaded so I suppose pyarrow could be faster when using multiple cores
This was done using versions:
In [18]: pa.__version__
Out[18]: '0.8.0'
In [20]: fp.__version__
Out[20]: '0.1.4'
In [21]: pd.__version__
Out[21]: '0.22.0'