Skip to content

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Sep 18, 2020

Don't rely on nullability values of leaf nodes matching their parents.

In general it feels like the WriteArrow code path in column_writer.cc could use some cleanup to remove duplicated code, but while ugly I think this fix works.

@emkornfield emkornfield requested a review from wesm September 18, 2020 09:21
@github-actions
Copy link

@emkornfield
Copy link
Contributor Author

There is a better solution. I'll update the PR

@emkornfield
Copy link
Contributor Author

Nm, I think this is likely the only reasonable approach. We might consider pushing bitmap building up the stack at some point.

@wesm wesm changed the title ARROW-9603: Fix parquet write ARROW-9603: [C++] Fix parquet write Sep 21, 2020
@wesm
Copy link
Member

wesm commented Sep 21, 2020

I'm not sure I have enough mental context to review this PR carefully

@emkornfield
Copy link
Contributor Author

@wesm hmm, I think you might be the only once familiar with the write path. But maybe @pitrou could give a more general review? Anyone else you would suggest?

@@ -838,10 +841,13 @@ class PathBuilder {
#undef NOT_IMPLEMENTED_VISIT
std::vector<PathInfo>& paths() { return paths_; }

bool root_is_nullable() const { return root_is_nullable_; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unused now.

@wesm
Copy link
Member

wesm commented Sep 21, 2020

@xhochy might be the only one. I can do my best to provide some comments

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed only the new parts -- overall seemed pretty reasonable. Can you update the PR title to explain the issue?

It's regrettable that this change has to touch so much code -- makes me think there could be some code restructurings possible in column_writer.cc, but not sure it's worth the expense right now

::arrow::Table::Make(
::arrow::schema({struct_field}),
{std::make_shared<::arrow::ChunkedArray>(::arrow::MakeArray(struct_data))}),
/*row_group_size=*/8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there might be a helper function opportunity if this pattern is repeated in other test functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out this could be simplified as well, so I don't think a helper function is necessary.

auto struct_data = std::make_shared<ArrayData>(
struct_field->type(), /*length=*/8,
std::vector<std::shared_ptr<Buffer>>{validity_bitmap},
std::vector<std::shared_ptr<ArrayData>>{int_array->data()});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ArrayData::Make for nicer syntax (don't have to write out std::vector<std::shared_ptr<Buffer>>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I somehow keep forgetting this.

@@ -871,6 +877,8 @@ class MultipathLevelBuilderImpl : public MultipathLevelBuilder {
std::move(write_leaf_callback));
}

bool Nested() const override { return !data_->child_data.empty(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsNested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

ctx);
PARQUET_CATCH_AND_RETURN(column_writer->WriteArrow(
result.def_levels, result.rep_levels, result.def_rep_level_count,
*values_array, ctx, level_builder->Nested(), result.leaf_is_nullable));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since WriteArrow returns Status, should we adopt that APIs must either return Status or throw an exception, but not both? (FWIW I regret that we chose to allow exceptions in the Parquet C++ project back in 2016)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I suppose it is too late to revisit this? Perhaps provide status/result returning methods in one PR and then deprecated exception throwing ones?

bool leaf_nulls_are_canonical =
(level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) &&
array_nullable;
bool maybe_has_nulls = nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that maybe_has_nulls is false whenever nested is false seems odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I renamed maybe_has_nulls to maybe_has_parent_nulls which is hopefully clearer? Happy to pick another name that makes sense.

buffers[0] = bits_buffer_;
DCHECK(array->num_fields() == 0);
return arrow::MakeArray(std::make_shared<ArrayData>(
array->type(), array->length(), std::move(buffers), new_null_count));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful someday to have a helper function to make an array copy with a particular buffer replaced, I seem to recall a JIRA issue about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, looks like:https://issues.apache.org/jira/browse/ARROW-7071 might be it?

*null_count = io.null_count;
}

std::shared_ptr<Array> MaybeUpdateArray(std::shared_ptr<Array> array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaybeReplaceValidity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@emkornfield emkornfield changed the title ARROW-9603: [C++] Fix parquet write ARROW-9603: [C++] Fix parquet write to not assume leaf-array validity bitmaps have the same values as parent structs Sep 22, 2020
Copy link
Contributor Author

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm thanks for the review. I addressed comments and rebased off of master to remove the first commit.

@@ -871,6 +877,8 @@ class MultipathLevelBuilderImpl : public MultipathLevelBuilder {
std::move(write_leaf_callback));
}

bool Nested() const override { return !data_->child_data.empty(); }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

ctx);
PARQUET_CATCH_AND_RETURN(column_writer->WriteArrow(
result.def_levels, result.rep_levels, result.def_rep_level_count,
*values_array, ctx, level_builder->Nested(), result.leaf_is_nullable));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I suppose it is too late to revisit this? Perhaps provide status/result returning methods in one PR and then deprecated exception throwing ones?

bool leaf_nulls_are_canonical =
(level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) &&
array_nullable;
bool maybe_has_nulls = nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I renamed maybe_has_nulls to maybe_has_parent_nulls which is hopefully clearer? Happy to pick another name that makes sense.

*null_count = io.null_count;
}

std::shared_ptr<Array> MaybeUpdateArray(std::shared_ptr<Array> array,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

buffers[0] = bits_buffer_;
DCHECK(array->num_fields() == 0);
return arrow::MakeArray(std::make_shared<ArrayData>(
array->type(), array->length(), std::move(buffers), new_null_count));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, looks like:https://issues.apache.org/jira/browse/ARROW-7071 might be it?

@emkornfield
Copy link
Contributor Author

@xhochy did you want to review?

ArrowWriteContext* ctx, bool nested, bool array_nullable) override {
BEGIN_PARQUET_CATCH_EXCEPTIONS
bool leaf_is_not_nullable = !level_info_.HasNullableValues();
// Leaf nulls are canonical when there is only a single null element and it is at the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"single nullable element" perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// leaf.
bool leaf_nulls_are_canonical =
(level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) &&
array_nullable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_nullable refers to the parent, the root, the leaf? This is difficult to follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename to parent_nullable or root_nullable or...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the leaf, will do some renaming to make this clearer.

ArrowWriteContext* ctx) override {
ArrowWriteContext* ctx, bool nested, bool array_nullable) override {
BEGIN_PARQUET_CATCH_EXCEPTIONS
bool leaf_is_not_nullable = !level_info_.HasNullableValues();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe avoid double negatives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will rename.

(level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) &&
array_nullable;
bool maybe_parent_nulls =
nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, if nested is false, is all this complicated dance required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested is actually unncessary. i've removed it. The only thing that matters is if the column is nullable according to columninfo and it isn't the only nullable column.

arrow::AllocateResizableBuffer(
BitUtil::BytesForBits(properties_->write_batch_size()), ctx->memory_pool));
bits_buffer_->ZeroPadding();
std::static_pointer_cast<ResizableBuffer>(AllocateBuffer(allocator_, 0));
Copy link
Member

@pitrou pitrou Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this allocating a new (temporary?) validity buffer for each write batch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should be removed. but above, yes, we do allocate a new buffer for each WriteArrow call. I think the lifecycle of this object might only be used for one WriteArrow call. internally there is a concept of batching, and the allocation should only happen once for here for each of those batches.

@xhochy
Copy link
Member

xhochy commented Sep 22, 2020

I reserved my self an hour tomorrow to review this. I haven't touched this code for over a year but this is the code path that actually got me into Arrow/Parquet project, so I'm happy to carve out time for it.

3);
}

TEST(ArrowReadWrite, NestedRequiredField) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases look very very similar, just the name and the used values differ. I would have expected that we also would have set nullable=false somewhere in this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you looking at the latest version?

Right below this comment is:

auto int_field = ::arrow::field("int_array", ::arrow::int32(), /*nullable=*/false);

(note the last parameter?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't see that when reviewing this code. Now this makes sense!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me.

@pitrou pitrou closed this in 04eb733 Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants