Skip to content

[Python] Change StructArray.field(..) to return "flattened" field? #14970

@jorisvandenbossche

Description

@jorisvandenbossche

Related to #14946 on the C++ side, and this recently came up in #14781 (comment).

A StructArray has child arrays that make up its "fields", but in addition it can also have a top-level validity bitmap. So when accessing a field of a StructArray that has such top-level nulls, you can retrieve the "raw" child array or you can get the "logical" field array that combines the child array with the top-level bitmap.

To illustrate:

In [1]: arr = pa.StructArray.from_arrays([pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'], mask=pa.array([False, True, False, False, False]))

In [2]: arr.to_pandas()
Out[2]: 
0    {'a': 5, 'b': 1}
1                None
2    {'a': 4, 'b': 3}
3    {'a': 2, 'b': 4}
4    {'a': 1, 'b': 5}
dtype: object

In [3]: arr.field('a')
Out[3]: 
<pyarrow.lib.Int64Array object at 0x7f9db84cdd20>
[
  5,
  3,
  4,
  2,
  1
]

In [4]: arr.flatten()[0]
Out[4]: 
<pyarrow.lib.Int64Array object at 0x7f9db855f400>
[
  5,
  null,
  4,
  2,
  1
]

Currently, the field() method on a StructArray gives you the raw child array, and there is a flatten() method that returns those "logical" field arrays for all the fields as a list of arrays.
We should have a method with which you can get the field array for a single field instead of having to use flatten(), and in #14781, @amol- added a _flattened_field (private for now, but we needed it to get the correct values to sort by):

In [5]: arr._flattened_field('a')
Out[5]: 
<pyarrow.lib.Int64Array object at 0x7f9db85d9780>
[
  5,
  null,
  4,
  2,
  1
]

We could just make that a public method instead, however, some questions/concerns about this:

  • I personally don't like the "flattened" term. I know we already use this in C++ as well (this basically just exposes the C++ StructArray::GetFlattenedField), but I don't find it very clear that it means this distinction.
  • We could also change field() instead? I personally think this is what people typically will want when they currently call field (like @amol- was doing in the sort PR, to get the values of a certain field of the struct). The value in the raw child that is being masked by the top-level bitmap is kind of an implementation detail, and IMO a user should not necessarily get that so easily.
  • If we would change field() to default to the "flattened" field, we need an alternative to access the raw child. We could add a keyword for this? (but what name?) Or a separate method like child()?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions