-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
The spec says that inner values of a Fixed-Size List are unspecified [1] if the list value is null. This exists (I suppose) to preserve the invariant that we can always set a value on any array to null by flipping a single bit on the bitmap [2]. So there was never a guarantee that a null fixed-size list value is aligned to a block of corresponding nulls at the child values array.
And yet, when FixedSizeListBuilder::AppendNull() is called, it populates the inner values array with nulls.
This means that an array that wouldn't necessarily require an internal validity bitmap like [[1, 2], null, [5, 6]] ends up getting 1 extra validity bitmap. The problem is even worse if the fixed-size lists are nested deeply — [[[1, 2], [3, 4]], null, [[9, 10], [11, 12]]] would require 3 validity bitmaps, but only the top-level one is necessary.
I propose that we ask the values builder to append empty values to the child array instead of nulls.
In summary, this is my proposal:
Status FixedSizeListBuilder::AppendNull() {
RETURN_NOT_OK(Reserve(1));
UnsafeAppendToBitmap(false);
- return value_builder_->AppendNulls(list_size_);
+ return value_builder_->AppendEmptyValues(list_size_);
}
Status FixedSizeListBuilder::AppendNulls(int64_t length) {
RETURN_NOT_OK(Reserve(length));
UnsafeAppendToBitmap(length, false);
- return value_builder_->AppendNulls(list_size_ * length);
+ return value_builder_->AppendEmptyValues(list_size_ * length);
}Risks: there is a risk that previously written code assumes that scalar values in the child array of a fixed-size list are null if the list value itself is null, but that's a very fragile assumption to begin with, because (1) the spec doesn't promise this, and more importantly (2) other ways of building fixed-size list arrays don't behave like this.
[1] https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout
[2] on the types that support validity bitmaps, which are almost all types
Component(s)
C++