Skip to content

[Python] Add from_numpy_ndarray and to_numpy_ndarray to ListArray types #35747

@spenczar

Description

@spenczar

Describe the enhancement requested

Interoperation between numpy ndarrays and Arrow's ListArray types (ListArray, LargeListArray, FixedSizeListArray) is a bit tricky.

It's hard to construct values: one must convert to a Python list-of-lists first, which is unnecessarily expensive:

>>> import numpy as np
>>> import pyarrow as pa
>>> np_values = np.ones((3, 2), np.float64())
>>> pa_dtype = pa.list_(pa.float64())
>>> pa_values= pa.array(np_values, type=pa_dtype)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: only handle 1-dimensional arrays
>>> pa_values = pa.array(np_values.tolist(), type=pa_dtype)
<pyarrow.lib.ListArray object at 0x11ba433a0>
[
  [
    1,
    1
  ],
  [
    1,
    1
  ],
  [
    1,
    1
  ]
]

Likewise, converting to a numpy ndarray from a Pyarrow ListArray type is tricky, as described in #35622. That issue describes trickiness with FixedSizeListArrays, but the same is true of ListArrays, which often might have equal-length lists in every entry, making them amenable to presentation as an ndarray.

I'd like to propose the following 6 new methods:

  • FixedSizeListArray.from_numpy_ndarray(values, type):
    Constructs a new FixedSizeListArray from values, which must be a numpy ndarray with ndim == 2.
    type is optional; it will be looked up from the ndarray's dtype if unset.
    If type is set, values of the ndarray's dtype must be convertible to the provided type.

  • FixedSizeListArray.to_numpy_ndarray(self):
    Returns the FixedSizeListArray's values as a numpy ndarray with a shape of (len(self), self.type.list_size).

    If any of the FixedSizeListArray's values are null, raises an error.

    If any of the FixedSizeListArray's values contain a null, then returns a ndarray with nan in the null spots, and with dtype set to float64, or None in the null spots and dtype of object if a conversion to float64 is not possible. This matches the behavior of Array.to_numpy for primitive types.

  • ListArray.from_numpy_ndarray(values, type):
    Works just like FixedSizeListArray.from_numpy_ndarray.

  • ListArray.to_numpy_ndarray(self):
    Works like FixedSizeListArray.to_numpy_ndarray, with an additional check that all list elements are of equal length. If any are different, then raises an error.

and same for LargeListArray as for ListArray, bringing the total to 6.

The FixedSizeListArray methods already have an implementation in the FixedShapeTensor extension type. Those implementation are actually a bit more complicated because of tensors' support for permutations:

def to_numpy_ndarray(self):

def from_numpy_ndarray(obj):

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions