-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Related to #14946 on the C++ side, and this recently came up in #14781 (comment).
A StructArray has child arrays that make up its "fields", but in addition it can also have a top-level validity bitmap. So when accessing a field of a StructArray that has such top-level nulls, you can retrieve the "raw" child array or you can get the "logical" field array that combines the child array with the top-level bitmap.
To illustrate:
In [1]: arr = pa.StructArray.from_arrays([pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'], mask=pa.array([False, True, False, False, False]))
In [2]: arr.to_pandas()
Out[2]:
0 {'a': 5, 'b': 1}
1 None
2 {'a': 4, 'b': 3}
3 {'a': 2, 'b': 4}
4 {'a': 1, 'b': 5}
dtype: object
In [3]: arr.field('a')
Out[3]:
<pyarrow.lib.Int64Array object at 0x7f9db84cdd20>
[
5,
3,
4,
2,
1
]
In [4]: arr.flatten()[0]
Out[4]:
<pyarrow.lib.Int64Array object at 0x7f9db855f400>
[
5,
null,
4,
2,
1
]
Currently, the field() method on a StructArray gives you the raw child array, and there is a flatten() method that returns those "logical" field arrays for all the fields as a list of arrays.
We should have a method with which you can get the field array for a single field instead of having to use flatten(), and in #14781, @amol- added a _flattened_field (private for now, but we needed it to get the correct values to sort by):
In [5]: arr._flattened_field('a')
Out[5]:
<pyarrow.lib.Int64Array object at 0x7f9db85d9780>
[
5,
null,
4,
2,
1
]
We could just make that a public method instead, however, some questions/concerns about this:
- I personally don't like the "flattened" term. I know we already use this in C++ as well (this basically just exposes the C++
StructArray::GetFlattenedField), but I don't find it very clear that it means this distinction. - We could also change
field()instead? I personally think this is what people typically will want when they currently callfield(like @amol- was doing in the sort PR, to get the values of a certain field of the struct). The value in the raw child that is being masked by the top-level bitmap is kind of an implementation detail, and IMO a user should not necessarily get that so easily. - If we would change
field()to default to the "flattened" field, we need an alternative to access the raw child. We could add a keyword for this? (but what name?) Or a separate method likechild()?