Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Mar 6, 2025

Rationale for this change

I am planning to introduce WriteBatchInternal and WriteBatchSpacedInternal private methods in #45360 which would have required specifying WriteArrowSerialize, WriteArrowZeroCopy and WriteTimestamps as friend functions. Then I noticed that these functions could be consolidated into the column writer making the implementation simpler.

What changes are included in this PR?

  • Move WriteArrowSerialize, WriteArrowZeroCopy and WriteTimestamps to be methods on TypedColumnWriterImpl.
  • Remove the column writer argument, reorder their parameters to align with WriteArrow public method.
  • Use more explicit type parameter names.

Are these changes tested?

Existing tests should cover these.

Are there any user-facing changes?

No, these are private functions and methods.

Resolves #45690

@kszucs kszucs requested a review from wgtmac as a code owner March 6, 2025 14:58
@kszucs kszucs changed the title [C++][Parquet] Consolidate Arrow write functions under TypedColumnWriterImpl [C++][Parquet] Consolidate Arrow write functions under TypedColumnWriterImpl Mar 6, 2025
@@ -1209,28 +1209,31 @@ Status ConvertDictionaryToDense(const ::arrow::Array& array, MemoryPool* pool,
return Status::OK();
}

template <typename DType>
class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<DType> {
template <typename ParquetType>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the implementation uses both parquet and arrow types, I thought it is better to be explicit. I can restore the previous name though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and already used by some functions in this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the implementation uses both parquet and arrow types, I thought it is better to be explicit.

+1

@@ -1364,6 +1367,41 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
int64_t num_levels, const ::arrow::Array& array,
ArrowWriteContext* context, bool maybe_parent_nulls);

template <typename ArrowType>
Status WriteArrowSerialize(const int16_t* def_levels, const int16_t* rep_levels,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistently using the same argument order as WriteArrow.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 6, 2025
@kszucs kszucs changed the title [C++][Parquet] Consolidate Arrow write functions under TypedColumnWriterImpl GH-45690: [C++][Parquet] Consolidate Arrow write functions under TypedColumnWriterImpl Mar 6, 2025
@github-actions
Copy link

github-actions bot commented Mar 6, 2025

⚠️ GitHub issue #45690 has been automatically assigned in GitHub to PR creator.

@apache apache deleted a comment from github-actions bot Mar 6, 2025
@@ -1209,28 +1209,31 @@ Status ConvertDictionaryToDense(const ::arrow::Array& array, MemoryPool* pool,
return Status::OK();
}

template <typename DType>
class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<DType> {
template <typename ParquetType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and already used by some functions in this file.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 7, 2025
@kszucs kszucs force-pushed the pq-writer-consolidation branch from 58b4427 to 93be82c Compare March 7, 2025 08:59
@pitrou pitrou self-requested a review March 7, 2025 09:46
@kszucs kszucs force-pushed the pq-writer-consolidation branch from 93be82c to 0bf997e Compare March 10, 2025 12:54
this->descr()->schema_node()->is_required() || (array.null_count() == 0);

if (!maybe_parent_nulls && no_nulls) {
PARQUET_CATCH_NOT_OK(WriteBatch(num_levels, def_levels, rep_levels, values));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if array.null_count() != 0 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the existing logic, I haven't changed it.
I there are actually null values in the array not just being nullable, then we need to pick the slower path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if array.null_count() != 0 here?

It should not happen when the field is required but has null values: https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1324

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Ideally we would have a DCHECK but since this is just moving code around we'll live without it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted some minor comments and suggestions, but this LGTM on the principle.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 12, 2025
…terImpl

This removes the need of passing the column writer instance and removes a redundant type template parameter.
@kszucs kszucs force-pushed the pq-writer-consolidation branch from 0bf997e to ad971a1 Compare March 12, 2025 16:41
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 12, 2025
@kszucs kszucs requested a review from pitrou March 12, 2025 18:29
@wgtmac wgtmac merged commit 6b66c84 into apache:main Mar 13, 2025
37 checks passed
@wgtmac wgtmac removed the awaiting change review Awaiting change review label Mar 13, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6b66c84.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Consolidate Arrow write functions under TypedColumnWriterImpl

4 participants