Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Feb 9, 2023

Rationale for this change

ColumnWriter::WriteArrowDictionary has tried to update stats but has problem if a single write has been split into batches and more than one page is written.

What changes are included in this PR?

Make sure every write of batch has updated the stats.

Are these changes tested?

Add test case which fails without the fix.

Are there any user-facing changes?

No.

@wgtmac wgtmac requested a review from wjones127 as a code owner February 9, 2023 20:33
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

@github-actions
Copy link

github-actions bot commented Feb 9, 2023

⚠️ GitHub issue #34106 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 9, 2023

Converted to draft because I hit another issue: #14870. The C++ parquet reader does not parse column statistics correctly here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L214

// Extracts encoded statistics from V1 and V2 data page headers
template <typename H>
EncodedStatistics ExtractStatsFromHeader(const H& header) {
  EncodedStatistics page_statistics;
  if (!header.__isset.statistics) {
    return page_statistics;
  }
  const format::Statistics& stats = header.statistics;
  if (stats.__isset.max) {
    page_statistics.set_max(stats.max);
  }
  if (stats.__isset.min) {
    page_statistics.set_min(stats.min);
  }
  if (stats.__isset.null_count) {
    page_statistics.set_null_count(stats.null_count);
  }
  if (stats.__isset.distinct_count) {
    page_statistics.set_distinct_count(stats.distinct_count);
  }
  return page_statistics;
}

Once #34112 is merged, the test failure here will be recovered.

@wgtmac wgtmac changed the title GH-34106: Fix updating page stats for WriteArrowDictionary GH-34106: [C++][Parquet] Fix updating page stats for WriteArrowDictionary Feb 15, 2023
@wgtmac wgtmac force-pushed the write_dict_update_stats branch from 09de262 to af1318e Compare February 17, 2023 16:33
@wgtmac wgtmac marked this pull request as ready for review February 17, 2023 16:33
@wgtmac
Copy link
Member Author

wgtmac commented Feb 17, 2023

@westonpace @wjones127 Please take a look. Thanks!

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@wgtmac
Copy link
Member Author

wgtmac commented Feb 21, 2023

Gentle ping @wjones127

@wjones127 wjones127 merged commit 476eb2e into apache:main Feb 21, 2023
@ursabot
Copy link

ursabot commented Feb 21, 2023

Benchmark runs are scheduled for baseline = 6850923 and contender = 476eb2e. 476eb2e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.46% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️1.02%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 476eb2ec ec2-t3-xlarge-us-east-2
[Failed] 476eb2ec test-mac-arm
[Finished] 476eb2ec ursa-i9-9960x
[Finished] 476eb2ec ursa-thinkcentre-m75q
[Finished] 6850923c ec2-t3-xlarge-us-east-2
[Failed] 6850923c test-mac-arm
[Finished] 6850923c ursa-i9-9960x
[Finished] 6850923c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Fix updating page statistics for WriteArrowDictionary

4 participants