-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9603: [C++] Fix parquet write to not assume leaf-array validity bitmaps have the same values as parent structs #8219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There is a better solution. I'll update the PR |
|
Nm, I think this is likely the only reasonable approach. We might consider pushing bitmap building up the stack at some point. |
|
I'm not sure I have enough mental context to review this PR carefully |
| @@ -838,10 +841,13 @@ class PathBuilder { | |||
| #undef NOT_IMPLEMENTED_VISIT | |||
| std::vector<PathInfo>& paths() { return paths_; } | |||
|
|
|||
| bool root_is_nullable() const { return root_is_nullable_; } | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unused now.
|
@xhochy might be the only one. I can do my best to provide some comments |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed only the new parts -- overall seemed pretty reasonable. Can you update the PR title to explain the issue?
It's regrettable that this change has to touch so much code -- makes me think there could be some code restructurings possible in column_writer.cc, but not sure it's worth the expense right now
| ::arrow::Table::Make( | ||
| ::arrow::schema({struct_field}), | ||
| {std::make_shared<::arrow::ChunkedArray>(::arrow::MakeArray(struct_data))}), | ||
| /*row_group_size=*/8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there might be a helper function opportunity if this pattern is repeated in other test functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it turns out this could be simplified as well, so I don't think a helper function is necessary.
| auto struct_data = std::make_shared<ArrayData>( | ||
| struct_field->type(), /*length=*/8, | ||
| std::vector<std::shared_ptr<Buffer>>{validity_bitmap}, | ||
| std::vector<std::shared_ptr<ArrayData>>{int_array->data()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ArrayData::Make for nicer syntax (don't have to write out std::vector<std::shared_ptr<Buffer>>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I somehow keep forgetting this.
| @@ -871,6 +877,8 @@ class MultipathLevelBuilderImpl : public MultipathLevelBuilder { | |||
| std::move(write_leaf_callback)); | |||
| } | |||
|
|
|||
| bool Nested() const override { return !data_->child_data.empty(); } | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsNested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/arrow/writer.cc
Outdated
| ctx); | ||
| PARQUET_CATCH_AND_RETURN(column_writer->WriteArrow( | ||
| result.def_levels, result.rep_levels, result.def_rep_level_count, | ||
| *values_array, ctx, level_builder->Nested(), result.leaf_is_nullable)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since WriteArrow returns Status, should we adopt that APIs must either return Status or throw an exception, but not both? (FWIW I regret that we chose to allow exceptions in the Parquet C++ project back in 2016)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. I suppose it is too late to revisit this? Perhaps provide status/result returning methods in one PR and then deprecated exception throwing ones?
cpp/src/parquet/column_writer.cc
Outdated
| bool leaf_nulls_are_canonical = | ||
| (level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) && | ||
| array_nullable; | ||
| bool maybe_has_nulls = nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that maybe_has_nulls is false whenever nested is false seems odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I renamed maybe_has_nulls to maybe_has_parent_nulls which is hopefully clearer? Happy to pick another name that makes sense.
cpp/src/parquet/column_writer.cc
Outdated
| buffers[0] = bits_buffer_; | ||
| DCHECK(array->num_fields() == 0); | ||
| return arrow::MakeArray(std::make_shared<ArrayData>( | ||
| array->type(), array->length(), std::move(buffers), new_null_count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful someday to have a helper function to make an array copy with a particular buffer replaced, I seem to recall a JIRA issue about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, looks like:https://issues.apache.org/jira/browse/ARROW-7071 might be it?
cpp/src/parquet/column_writer.cc
Outdated
| *null_count = io.null_count; | ||
| } | ||
|
|
||
| std::shared_ptr<Array> MaybeUpdateArray(std::shared_ptr<Array> array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaybeReplaceValidity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
26745a9 to
96d2ad5
Compare
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm thanks for the review. I addressed comments and rebased off of master to remove the first commit.
| @@ -871,6 +877,8 @@ class MultipathLevelBuilderImpl : public MultipathLevelBuilder { | |||
| std::move(write_leaf_callback)); | |||
| } | |||
|
|
|||
| bool Nested() const override { return !data_->child_data.empty(); } | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/arrow/writer.cc
Outdated
| ctx); | ||
| PARQUET_CATCH_AND_RETURN(column_writer->WriteArrow( | ||
| result.def_levels, result.rep_levels, result.def_rep_level_count, | ||
| *values_array, ctx, level_builder->Nested(), result.leaf_is_nullable)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. I suppose it is too late to revisit this? Perhaps provide status/result returning methods in one PR and then deprecated exception throwing ones?
cpp/src/parquet/column_writer.cc
Outdated
| bool leaf_nulls_are_canonical = | ||
| (level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) && | ||
| array_nullable; | ||
| bool maybe_has_nulls = nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I renamed maybe_has_nulls to maybe_has_parent_nulls which is hopefully clearer? Happy to pick another name that makes sense.
cpp/src/parquet/column_writer.cc
Outdated
| *null_count = io.null_count; | ||
| } | ||
|
|
||
| std::shared_ptr<Array> MaybeUpdateArray(std::shared_ptr<Array> array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/column_writer.cc
Outdated
| buffers[0] = bits_buffer_; | ||
| DCHECK(array->num_fields() == 0); | ||
| return arrow::MakeArray(std::make_shared<ArrayData>( | ||
| array->type(), array->length(), std::move(buffers), new_null_count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, looks like:https://issues.apache.org/jira/browse/ARROW-7071 might be it?
|
@xhochy did you want to review? |
cpp/src/parquet/column_writer.cc
Outdated
| ArrowWriteContext* ctx, bool nested, bool array_nullable) override { | ||
| BEGIN_PARQUET_CATCH_EXCEPTIONS | ||
| bool leaf_is_not_nullable = !level_info_.HasNullableValues(); | ||
| // Leaf nulls are canonical when there is only a single null element and it is at the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"single nullable element" perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/column_writer.cc
Outdated
| // leaf. | ||
| bool leaf_nulls_are_canonical = | ||
| (level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) && | ||
| array_nullable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_nullable refers to the parent, the root, the leaf? This is difficult to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename to parent_nullable or root_nullable or...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is the leaf, will do some renaming to make this clearer.
cpp/src/parquet/column_writer.cc
Outdated
| ArrowWriteContext* ctx) override { | ||
| ArrowWriteContext* ctx, bool nested, bool array_nullable) override { | ||
| BEGIN_PARQUET_CATCH_EXCEPTIONS | ||
| bool leaf_is_not_nullable = !level_info_.HasNullableValues(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe avoid double negatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will rename.
cpp/src/parquet/column_writer.cc
Outdated
| (level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) && | ||
| array_nullable; | ||
| bool maybe_parent_nulls = | ||
| nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, if nested is false, is all this complicated dance required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested is actually unncessary. i've removed it. The only thing that matters is if the column is nullable according to columninfo and it isn't the only nullable column.
cpp/src/parquet/column_writer.cc
Outdated
| arrow::AllocateResizableBuffer( | ||
| BitUtil::BytesForBits(properties_->write_batch_size()), ctx->memory_pool)); | ||
| bits_buffer_->ZeroPadding(); | ||
| std::static_pointer_cast<ResizableBuffer>(AllocateBuffer(allocator_, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this allocating a new (temporary?) validity buffer for each write batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line should be removed. but above, yes, we do allocate a new buffer for each WriteArrow call. I think the lifecycle of this object might only be used for one WriteArrow call. internally there is a concept of batching, and the allocation should only happen once for here for each of those batches.
|
I reserved my self an hour tomorrow to review this. I haven't touched this code for over a year but this is the code path that actually got me into Arrow/Parquet project, so I'm happy to carve out time for it. |
| 3); | ||
| } | ||
|
|
||
| TEST(ArrowReadWrite, NestedRequiredField) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases look very very similar, just the name and the used values differ. I would have expected that we also would have set nullable=false somewhere in this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you looking at the latest version?
Right below this comment is:
auto int_field = ::arrow::field("int_array", ::arrow::int32(), /*nullable=*/false);
(note the last parameter?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't see that when reviewing this code. Now this makes sense!
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me.
Don't rely on nullability values of leaf nodes matching their parents.
In general it feels like the WriteArrow code path in column_writer.cc could use some cleanup to remove duplicated code, but while ugly I think this fix works.