-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
There are some data types where the nulls are not stored "physically" using a validity bitmap on the parent ArrayData, but through nulls in child data:
- UnionArrays don't have a top-level validity bitmap, but still has "logical" nulls: "the nullness of each slot is determined exclusively by the child arrays which are composed to create the union" (https://arrow.apache.org/docs/dev/format/Columnar.html#union-layout)
- Run-End Encoded arrays similarly don't have a top-level validity, but the values child array can have nulls (related: GH-33830: Clarify handling of Null values in REE encoding #33831 explicitly added "The REE parent has no validity bitmap, and it's null count field should always be 0. Null values are encoded as runs with the value null." to the spec)
(sidenote: Dictionary arrays could be considered here as well, but are a bit a mixed bag: there is a top-level null count through nulls in the indices, but additionally also the dictionary can contain nulls. So nulls can be encoded in two different ways)
The format specification has a "null_count" (in the IPC FieldNode in the Recordbatch message, and in the C Data Interface), and in those cases this refers to the "physical" null count. And this is followed by the C++ implementation, where the base Array::null_count() (implemented by ArrayData::GetNullCount()) looks at the validity buffer (typically the first buffer) to count the the unset bits, or directly return 0 if there is no validity buffer.
However, in practice you often want to know if there are actual "logical" nulls (not considering those leads to bugs, for example #34315).
@felipecrv @westonpace and I had some discussion about this on zulip (https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20count.20of.20a.20UnionArray/near/336538844), and I think our current idea would be:
-
Add a
GetLogicalNullCountto complement the existingArray::null_count()/ArrayData::GetNullCount()(changingnull_count()itself might be too much of a breaking change? And would also create an inconsistency with where this is used in the specs) -
Change
Array::IsNull(i)to consider logical nulls instead of just physical nulls