Skip to content

Conversation

@kou
Copy link
Member

@kou kou commented Sep 4, 2024

Rationale for this change

Statistics is useful for fast processing.

Target types:

  • UInt8
  • Int8
  • UInt16
  • Int16
  • UInt32
  • UInt64
  • Date32
  • Time32
  • Time64
  • Duration

What changes are included in this PR?

Map ColumnChunkMetaData information to arrow::ArrayStatistics.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

…: non zero-copy int based types

Target types:

* `UInt8`
* `Int8`
* `UInt16`
* `Int16`
* `UInt32`
* `UInt64`
* `Date32`
* `Time32`
* `Time64`
* `Duration`
@kou kou requested a review from wgtmac as a code owner September 4, 2024 08:36
@github-actions
Copy link

github-actions bot commented Sep 4, 2024

⚠️ GitHub issue #43944 has been automatically assigned in GitHub to PR creator.

Comment on lines +361 to +362
array_statistics->is_min_exact = true;
array_statistics->is_max_exact = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add correspond comment here? This might be a bit tricky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We should document about the discussion at #43595 (comment) , right?

BTW, could you share the e-mail URL for #43595 (comment) ?

I guess no, I'll send a mail to maillist to make it sure

I couldn't find it at https://lists.apache.org/list.html?dev@parquet.apache.org .

Ah, I forgot to add a writer check here. I should have set true only when a writer is Apache Parquet C++. I'll fix it.

Copy link
Member

@mapleFU mapleFU Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me setup a discussion, generally if it's from Parquet C++, it will works. I'm a bit busy this morning preparing for my tour, I'll try to work it out this noon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment that we can always use true for integer based min/max.

I didn't need if (::arrow::internal::StartsWith(ctx->reader->metadata()->created_by(), "parquet-cpp-arrow")) for this case based on your e-mail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I found the string and FLBA might being truncated, other types in public impl will not being truncated if exists

auto array_data =
::arrow::ArrayData::Make(field->type(), length, std::move(buffers), null_count);
auto array_statistics = std::make_shared<::arrow::ArrayStatistics>();
array_statistics->null_count = null_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null_count for some type ( nested ) would be a bit weird, FYI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information.
Let's revisit it when we add support for arrow::ArrayStatistics of nested types.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 5, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 5, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 5, 2024
array_statistics->null_count = null_count;
auto statistics = metadata->statistics().get();
if (statistics) {
if (statistics->HasDistinctCount()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate function for the stats conversion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it when I add more target types as the next pull request.
I'll know what is common pattern when I add more target types.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 5, 2024
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou merged commit 262d6f6 into apache:main Sep 5, 2024
@kou kou deleted the cpp-parquet-statistics branch September 5, 2024 20:41
@kou kou removed the awaiting changes Awaiting changes label Sep 5, 2024
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 262d6f6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…: non zero-copy int based types (apache#43945)

### Rationale for this change

Statistics is useful for fast processing.

Target types:

* `UInt8`
* `Int8`
* `UInt16`
* `Int16`
* `UInt32`
* `UInt64`
* `Date32`
* `Time32`
* `Time64`
* `Duration`

### What changes are included in this PR?

Map `ColumnChunkMetaData` information to `arrow::ArrayStatistics`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: apache#43944

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
RETURN_NOT_OK(
TransferColumnData(record_reader_.get(), field_, descr_, ctx_->pool, &out_));
RETURN_NOT_OK(TransferColumnData(record_reader_.get(),
input_->column_chunk_metadata(), field_, descr_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to input_->column_chunk_metadata() fails here if the list of row_groups in input_ is empty, because input_ is not yet initialized properly at this point in that case

Via row_group_metadata() -> RowGroup(-1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
Could you open a new issue for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants