Skip to content

[Python] ListArray.from_arrays does not check validity of input arrays #22527

@asfimport

Description

@asfimport

From #4979 (comment).

When creating a ListArray from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").

The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:

In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 

In [62]: a
Out[62]: 
<pyarrow.lib.ListArray object at 0x7fdd9c468678>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

In [63]: a.flatten()
Out[63]: 
<pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
[
  0,   # <--- includes the 0
  1,
  2,
  3,
  4
]

In [64]: a.to_pylist()
Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more elements as garbage

Calling validate manually correctly raises:

In [65]: a.validate()
...
ArrowInvalid: Final offset invariant not equal to values length: 10!=5

In C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well? Or should we do a call to validate here?

A quick search seems to indicate that pa.Array.from_buffers does validation, but other from_arrays method don't seem to explicitly do this.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-6132. Please see the migration documentation for further details.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions