-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-45690: [C++][Parquet] Consolidate Arrow write functions under TypedColumnWriterImpl
#45688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TypedColumnWriterImpl
cpp/src/parquet/column_writer.cc
Outdated
| @@ -1209,28 +1209,31 @@ Status ConvertDictionaryToDense(const ::arrow::Array& array, MemoryPool* pool, | |||
| return Status::OK(); | |||
| } | |||
|
|
|||
| template <typename DType> | |||
| class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<DType> { | |||
| template <typename ParquetType> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the implementation uses both parquet and arrow types, I thought it is better to be explicit. I can restore the previous name though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense and already used by some functions in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the implementation uses both parquet and arrow types, I thought it is better to be explicit.
+1
cpp/src/parquet/column_writer.cc
Outdated
| @@ -1364,6 +1367,41 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter< | |||
| int64_t num_levels, const ::arrow::Array& array, | |||
| ArrowWriteContext* context, bool maybe_parent_nulls); | |||
|
|
|||
| template <typename ArrowType> | |||
| Status WriteArrowSerialize(const int16_t* def_levels, const int16_t* rep_levels, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistently using the same argument order as WriteArrow.
TypedColumnWriterImplTypedColumnWriterImpl
|
|
cpp/src/parquet/column_writer.cc
Outdated
| @@ -1209,28 +1209,31 @@ Status ConvertDictionaryToDense(const ::arrow::Array& array, MemoryPool* pool, | |||
| return Status::OK(); | |||
| } | |||
|
|
|||
| template <typename DType> | |||
| class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<DType> { | |||
| template <typename ParquetType> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense and already used by some functions in this file.
58b4427 to
93be82c
Compare
93be82c to
0bf997e
Compare
cpp/src/parquet/column_writer.cc
Outdated
| this->descr()->schema_node()->is_required() || (array.null_count() == 0); | ||
|
|
||
| if (!maybe_parent_nulls && no_nulls) { | ||
| PARQUET_CATCH_NOT_OK(WriteBatch(num_levels, def_levels, rep_levels, values)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if array.null_count() != 0 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the existing logic, I haven't changed it.
I there are actually null values in the array not just being nullable, then we need to pick the slower path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if array.null_count() != 0 here?
It should not happen when the field is required but has null values: https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. Ideally we would have a DCHECK but since this is just moving code around we'll live without it.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posted some minor comments and suggestions, but this LGTM on the principle.
…terImpl This removes the need of passing the column writer instance and removes a redundant type template parameter.
0bf997e to
ad971a1
Compare
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6b66c84. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
I am planning to introduce
WriteBatchInternalandWriteBatchSpacedInternalprivate methods in #45360 which would have required specifyingWriteArrowSerialize,WriteArrowZeroCopyandWriteTimestampsas friend functions. Then I noticed that these functions could be consolidated into the column writer making the implementation simpler.What changes are included in this PR?
WriteArrowSerialize,WriteArrowZeroCopyandWriteTimestampsto be methods onTypedColumnWriterImpl.column writerargument, reorder their parameters to align withWriteArrowpublic method.Are these changes tested?
Existing tests should cover these.
Are there any user-facing changes?
No, these are private functions and methods.
Resolves #45690
TypedColumnWriterImpl#45690