Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Dec 15, 2022

Basically, the patch provides following implementation:

  • Define class RowGroupPageIndexReader to read page index from a parquet row group. It internally leverages implementation from Apache Impala link to merge I/O chunks of page index in the same row group.
  • Define class PageIndexReader to create RowGroupPageIndexReader for each row group.
  • ParquetFileReader internally creates and caches a single PageIndexReader object and exposes it to the end user.

Limitation:

  • Reading page index from encrypted parquet file is not yet supported. It takes some effort to understand the specs and will be done in a separate patch after the completion of write logic of page index.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has no components in JIRA, make sure you assign one.

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@emkornfield
Copy link
Contributor

@wgtmac took a pass through mostly looking at interfaces, let me know what you think of the suggestions. Sorry for the delay, catching up after the holidays.

@wgtmac wgtmac requested a review from emkornfield January 8, 2023 13:17
@wgtmac wgtmac changed the title ARROW-18434: [C++][Parquet] Parquet page index read support GH-33596: [C++][Parquet] Parquet page index read support Jan 12, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #33596 has been automatically assigned in GitHub to PR creator.

@wgtmac wgtmac force-pushed the ARROW-18434 branch 2 times, most recently from 06be080 to d429260 Compare January 13, 2023 11:23
@wgtmac
Copy link
Member Author

wgtmac commented Jan 13, 2023

I have added a reader test to cover the new interface. Now it is complete and ready to review. Any feedback is appreciated. @pitrou @emkornfield @wjones127

@wgtmac
Copy link
Member Author

wgtmac commented Jan 20, 2023

@emkornfield Any chance to take another pass? Thanks in advance!

@emkornfield
Copy link
Contributor

Sorry for delay will be looking at this by EOD friday

@wgtmac
Copy link
Member Author

wgtmac commented Feb 2, 2023

All comments are addressed. Thank you for the review @pitrou

@wgtmac wgtmac requested a review from pitrou February 2, 2023 11:39
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @wgtmac , thanks a lot!

@pitrou
Copy link
Member

pitrou commented Feb 2, 2023

Hmm, I think the CI failure in https://github.com/apache/arrow/actions/runs/4075827255/jobs/7022748134#step:9:742 is unfortunately legit. Here is what I think happens:

  • Some Windows headers define macros such as OPTIONAL
  • CMake Unity builds bundle C++ source files together, adding a new source file probably changed the way the bundling happens and uncovered this issue

I'll try to push a possible fix...

@pitrou
Copy link
Member

pitrou commented Feb 2, 2023

Ok, I think the fix worked. I also rebased on latest master.

@pitrou pitrou merged commit b0e1037 into apache:master Feb 2, 2023
@ursabot
Copy link

ursabot commented Feb 3, 2023

Benchmark runs are scheduled for baseline = a703a07 and contender = b0e1037. b0e1037 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.15% ⬆️0.0%] test-mac-arm
[Finished ⬇️2.55% ⬆️0.26%] ursa-i9-9960x
[Finished ⬇️0.57% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] b0e10379 ec2-t3-xlarge-us-east-2
[Finished] b0e10379 test-mac-arm
[Finished] b0e10379 ursa-i9-9960x
[Finished] b0e10379 ursa-thinkcentre-m75q
[Finished] a703a075 ec2-t3-xlarge-us-east-2
[Failed] a703a075 test-mac-arm
[Finished] a703a075 ursa-i9-9960x
[Finished] a703a075 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Feb 3, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Parquet page index read support

6 participants