-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-33596: [C++][Parquet] Parquet page index read support #14964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
|
|
@wgtmac took a pass through mostly looking at interfaces, let me know what you think of the suggestions. Sorry for the delay, catching up after the holidays. |
|
|
06be080 to
d429260
Compare
|
I have added a reader test to cover the new interface. Now it is complete and ready to review. Any feedback is appreciated. @pitrou @emkornfield @wjones127 |
|
@emkornfield Any chance to take another pass? Thanks in advance! |
|
Sorry for delay will be looking at this by EOD friday |
|
All comments are addressed. Thank you for the review @pitrou |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @wgtmac , thanks a lot!
|
Hmm, I think the CI failure in https://github.com/apache/arrow/actions/runs/4075827255/jobs/7022748134#step:9:742 is unfortunately legit. Here is what I think happens:
I'll try to push a possible fix... |
|
Ok, I think the fix worked. I also rebased on latest master. |
|
Benchmark runs are scheduled for baseline = a703a07 and contender = b0e1037. b0e1037 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
Basically, the patch provides following implementation:
class RowGroupPageIndexReaderto read page index from a parquet row group. It internally leverages implementation from Apache Impala link to merge I/O chunks of page index in the same row group.class PageIndexReaderto createRowGroupPageIndexReaderfor each row group.ParquetFileReaderinternally creates and caches a singlePageIndexReaderobject and exposes it to the end user.Limitation: