-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls #35036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls #35036
Conversation
…ded logical nulls
|
cc @felipecrv (this is very much a draft (still needs more tests, doc comments, applying the same changes to is_valid, etc), but already would like to get your thoughts on the implementation approach) |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
|
||
| const auto t = arr.type->id(); | ||
| if (t == Type::SPARSE_UNION) { | ||
| union_util::SetLogicalSparseUnionNullBits(arr, out_bitmap, out_span->offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the same pattern as we had for SetNanBits (used just below, i.e. updating the previously created bitmap), not that this is necessarily the best approach though
| // ASSERT_OK(ree_array->ValidateFull()); | ||
| // auto expected = | ||
| // ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]"); | ||
| // CheckScalarUnary("is_null", ree_array, expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is passing, except for the variant that runs it with scalars, because ArraySpan::FillFromScalar is not implemented for RunEndEncoded cc @felipecrv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had implemented that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding it! I'm gonna use it for something I'm doing as well.
cpp/src/arrow/array/data.cc
Outdated
| // Populate null count and validity bitmap (only for non-union/null types) | ||
| this->null_count = value.is_valid ? 0 : 1; | ||
| if (!is_union(type_id) && type_id != Type::NA) { | ||
| // Populate null count and validity bitmap (only for non-union/ree/null types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment can now be removed as the if checks wrapping the null_count assignments document the context of each assignment
cpp/src/arrow/array/data.cc
Outdated
| } else if (type_id == Type::RUN_END_ENCODED) { | ||
| const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value); | ||
| this->child_data.resize(2); | ||
| auto run_end_type = scalar.run_end_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto& to elide one refcount increment
| const auto t = arr.type->id(); | ||
| if (t == Type::SPARSE_UNION) { | ||
| SetSparseUnionLogicalNullBits(arr, out_bitmap, out_span->offset); | ||
| } else if (t == Type::DENSE_UNION) { | ||
| SetDenseUnionLogicalNullBits(arr, out_bitmap, out_span->offset); | ||
| } else if (t == Type::RUN_END_ENCODED) { | ||
| SetREELogicalNullBits(arr, out_bitmap, out_span->offset); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block can be moved into the else above to make it more clear that this only gets called after a complete zeroing of out_bitmap.
Now we have to know about a logical implication that is not explicit in the code: t \in {UNION, REE} \implies arr.GetNullCount() == 0. That is what in turn guarantees out_bitmap is zeroed before the Set*LogicalNullBits. Too much logic. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The // Input has no nulls => output is entirely false. also needs to be more nuanced after that move.
cpp/src/arrow/array/data.cc
Outdated
| span->buffers[buffer_index].size = 2 * sizeof(offset_type); | ||
| } | ||
|
|
||
| template <typename ArrowType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: To be consistent with REE code elsewhere: RunEndType.
cpp/src/arrow/array/data.cc
Outdated
| void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type, | ||
| uint64_t* scratch_space) { | ||
| // Create a lenght-1 array with the value 1 for a REE scalar | ||
| using run_type = typename ArrowType::c_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_end_type (run_type is ambiguous -- could be the type of the value in the run).
cpp/src/arrow/array/data.cc
Outdated
| template <typename ArrowType> | ||
| void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type, | ||
| uint64_t* scratch_space) { | ||
| // Create a lenght-1 array with the value 1 for a REE scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment can be removed after the other changes that clarify the code are made.
cpp/src/arrow/array/data.cc
Outdated
| } | ||
|
|
||
| template <typename ArrowType> | ||
| void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FillRunEndsArrayFromScalar? s/type/run_end_type
cpp/src/arrow/array/data.cc
Outdated
| buffer[0] = static_cast<run_type>(1); | ||
| auto data_buffer = | ||
| std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type)); | ||
| auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::move(type) and std::move(data_buffer)
cpp/src/arrow/array/data.cc
Outdated
| auto data_buffer = | ||
| std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type)); | ||
| auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0); | ||
| span->child_data[0].SetMembers(*data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better if the caller passes this->child_data[0] to this instead of the span. Making it obvious at the callsite how both child_data are filled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could derive the scratch_space parameter from that as well: making you use the child_data[0].scratch_space instead of the parent's scratch_space.
cpp/src/arrow/array/data.cc
Outdated
| break; | ||
| default: | ||
| DCHECK_EQ(run_end_type->id(), Type::INT64); | ||
| SetRunForScalar<Int64Type>(this, run_end_type, this->scratch_space); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're getting that stack trace you posted above from this new implementation?
I think this will be less risky if you use the this->child_data[0].scratch_space instead of this->scratch_space for the run-ends array buffer. I think that's the contract: an ArraySpan is free to use the scratch_space it directly owns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're getting that stack trace you posted above from this new implementation?
No, that was with the previous implementation. This last commit was to fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But thanks for the comments! Will clean-up
|
I don't know if it's related or not but there is some danger with the way we handle scratch space (#35581 ) |
cpp/src/arrow/array/data.cc
Outdated
| std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(RunEndCType)); | ||
| auto data = | ||
| ArrayData::Make(std::move(run_end_type), 1, {nullptr, std::move(data_buffer)}, 0); | ||
| span->SetMembers(*data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... ArraySpan is non-owning, so it won't keep this data available. Sorry for not noticing this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to set the members directly. One by one here. That skips allocating and deallocating an ArrayData which is the point of using ArraySpans in the first-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_end_type should be a pointer coming directly from the parent REE type, so don't pass it as a shared_ptr copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same is done for Dictionary though:
// Populate dictionary data
const auto& dict_scalar = checked_cast<const DictionaryScalar&>(value);
this->child_data.resize(1);
this->child_data[0].SetMembers(*dict_scalar.value.dictionary->data());
Or is that fine because that is data owned by the scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is that fine because that is data owned by the scalar?
Yes.
| void SetREELogicalNullBits(const ArraySpan& span, uint8_t* out_bitmap, | ||
| int64_t out_offset) { | ||
| const auto& values = arrow::ree_util::ValuesArray(span); | ||
| DCHECK(!is_nested(values.type->id())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, it is not forbidden to have nested REE values. So this should be turned into an error return (perhaps Status::NotImplemented) at a higher level.
However, instead of refusing to implement this, you could use another strategy:
- call
IsNullon the values array - REE-decode the REE array comprised of (run ends, run values IsNull)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to send a PR forbidding nested REEs because it's pointless to run-end encode a values array since all the runs already have length 1 and the run-end arrays contains a strictly increasing sequence of indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipecrv I think you're talking about something else? "Nested" here means any type with child fields (e.g. list, struct...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use another strategy:
That would indeed be a good alternative and would be more robust for whathever type is used for the REE values. I did a quick benchmark comparing both strategies in python:
In [2]: run_lengths = np.random.randint(1, 10, 100_000)
In [3]: run_values = [1, 2, 3, 4, None] * 20000
In [4]: arr = pa.RunEndEncodedArray.from_arrays(run_lengths.cumsum(), run_values)
In [5]: res1 = pc.is_null(arr)
In [6]: res2 = pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values)))
In [7]: res1.equals(res2)
Out[7]: True
In [8]: %timeit pc.is_null(arr)
309 µs ± 843 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values)))
1.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is running with this branch (in release mode), so pc.is_null is using this PR's implementation, and the other is the python equivalent of what you propose (IIUC).
The alternative seems significantly slower, although I don't know how much of that is due to overhead of going through python several times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, but the proposal here is to make things correct first ;-) It's also possible that run-end-decoding could be optimized...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on a quick profile, a significant part of the time is spent in the actual RunEndDecode impl, indicating it's not just from the python overhead. But it's also true that this implementation is not necessary optimized ;)
Now maybe the DCHECK(!is_nested(values.type->id())); isn't actually correct. What it is protecting from is that I am getting the first buffer of the data and assuming this is a validity bitmap. But that's only not true for REE/union, and not for "nested" types in general.
Although for supporting nan_is_null, that will still be easier to do through recursively calling the is_null kernel. Will take a look at doing it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I misunderstood "nested REE" as a REE inside another REE.
| auto expected = | ||
| ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]"); | ||
| CheckScalarUnary("is_null", ree_array, expected); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with floating-point run end values and clearing/setting the nan_is_null option?
| DenseUnionArray::Make(*type_ids, *value_offsets, | ||
| {dense_field_i64, dense_field_str})); | ||
| CheckScalarUnary("is_null", arr2, expected); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add floating-point child union values and test with/without the nan_is_null option set?
| assert result.equals(expected) | ||
|
|
||
|
|
||
| def test_is_null_union(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to add these tests on the Python side? You have similar tests in C++ already.
| } | ||
|
|
||
| template <typename RunEndType> | ||
| void FillRunEndsArrayForScalar(ArraySpan* span, const DataType* run_end_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes related for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, can you perhaps add a test for FillFromScalar on a run-end-encoded scalar? For example in arrow/array/array_run_end_test.cc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the tests for scalar kernels automatically run the kernel also on actual scalars, and then this ends up creating an array from the scalar. So those changes to FillFromScalar are needed to be able to run any kernel on a REE scalar.
At the moment we don't actually have any test for FillFromScalar directly (also not for other types), only through testing the kernels on scalars.
| // Populate null count and validity bitmap | ||
| if (type_id == Type::NA) { | ||
| this->null_count = 1; | ||
| } else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) { | |
| } else if (!HasValidityBitmap(type_id)) { |
| } else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) { | ||
| this->null_count = 0; | ||
| } else { | ||
| this->null_count = value.is_valid ? 0 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, but they have the same semantics as dictionary arrays: the top-level validity refers to the indices, regardless of the underlying dictionary values.
| } else if (type_id == Type::RUN_END_ENCODED) { | ||
| const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value); | ||
| this->child_data.resize(2); | ||
| auto& run_end_type = scalar.run_end_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: style
| auto& run_end_type = scalar.run_end_type(); | |
| const auto& run_end_type = scalar.run_end_type(); |
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
Rationale for this change
Currently the
is_nullkernel always returns all-False for union and run-end-encoded types, since those don't have a top-level validity buffer. Update the kernel to take the logical nulls into account.Are these changes tested?
Yes, both in Python and C++
Are there any user-facing changes?
Yes, this changes (corrects) the behaviour of the
is_nullkernel.pa.compute.is_null()returns incorrect answer for dense union arrays and segfaults for dense union scalars #34315