Skip to content

pyarrow / pandas support for tensors (multi-dimensional arrays) #4802

@pwais

Description

@pwais

pyarrow appears to have a tensor type: https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html#pyarrow.Tensor

But numpy arrays do not get translated to Tensors when used in pandas dataframes:

if __name__ == '__main__':
  import numpy as np
  import pandas as pd
  a = np.array([
                [[1, 2],[1, 2],[1, 2]],
                [[1, 2],[1, 2],[1, 2]],
                [[1, 2],[1, 2],[1, 2]]])
  df = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':[a, a, a]})
  print(df)

  import pyarrow as pa
  import pyarrow.parquet as pq
  print(pa.__version__)

  table = pa.Table.from_pandas(df)
  pq.write_to_dataset(
        table,
        '/tmp/rows')

results in:

# python3 yay.py 
   x  y                                                  z
0  0  a  [[[1, 2], [1, 2], [1, 2]], [[1, 2], [1, 2], [1...
1  1  b  [[[1, 2], [1, 2], [1, 2]], [[1, 2], [1, 2], [1...
2  2  b  [[[1, 2], [1, 2], [1, 2]], [[1, 2], [1, 2], [1...
0.14.0
Traceback (most recent call last):
  File "yay.py", line 15, in <module>
    table = pa.Table.from_pandas(df)
  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 496, in dataframe_to_arrays
    for c, f in zip(columns_to_convert, convert_fields)]
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 496, in <listcomp>
    for c, f in zip(columns_to_convert, convert_fields)]
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 487, in convert_column
    raise e
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 481, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column z with type object')

Might we be able to simply patch pandas_compat.py to do Tensor conversions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions