GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO #34461

Hor911 · 2023-03-05T23:37:19Z

Rationale for this change

Current implementation of arrow::FileReader::ReadRowGroups() has sync interface. It complicates use in environments where additional threads are undesirable. Splitting this method into 2 parts will fix it. Details and usage examples are inside issue description.

What changes are included in this PR?

Two new methods in arrow::FileReader class

Are these changes tested?

Changes were tested actively in our private repository.

Are there any user-facing changes?

No changes to the current functionality. PR is simple enough and expects no regression.

Closes: [C++] Split arrow::FileReader::ReadRowGroups() to 2 methods for flexible async IO #34460

github-actions · 2023-03-05T23:37:38Z

Closes: [C++] Split arrow::FileReader::ReadRowGroups() to 2 methods for flexible async IO #34460

github-actions · 2023-03-05T23:37:40Z

⚠️ GitHub issue #34460 has been automatically assigned in GitHub to PR creator.

…async IO #34461 apache/arrow#34461

wjones127 · 2023-03-06T17:09:21Z

cpp/src/parquet/arrow/reader.h

+  virtual ::arrow::Status WillNeedRowGroups(const std::vector<int>& row_groups,
+                                            const std::vector<int>& column_indices) = 0;


How does the caller know when the row groups are loaded? Should this return a Future instead?

It can't be expressed in this API. This method is translated into call of arrow::io::RandomAccessFile::WillNeed()

No-op is default and valid implementation of WillNeed. It means that no preload/prefetch is provided in this RAF implementation. All work will be done when ReadAt or ReadAsync is called.

Current Arrow API expect tight coupling between FileReader, ParquetFileReader and intermediate Cache. It is not possible to provide true async decoupling w/o significant API changes (it was discussed somewhere).

For my technique to work, one should provide special implementation of arrow::io::RandomAccessFile which will receive WillNeed, download the data and signals it in some "hidden" way. Not perfect, but possible to reach what I needed w/o API changes and any other side effects.

I think I'll be able to provide you link ro real use case tomorrow.

#14723 adds a filesystem method for "read many". I would like to see this method support plugging and splitting in the same way that ReadRangeCache does today (then, ReadRangeCache will only be needed if you need true "caching"). Then I think we can use that instead of the ReadRangeCache.

This will allow local filesystems to rely on the OS for plugging & splitting and will allow remote filesystems like S3 to adapt the algorithm to their needs. It's also async and returns a future reliably so you can then return a future from this method (I agree that would be desired).

+1 for @westonpace's suggestion.

In addition, what if WillNeedRowGroups (w/ or w/o same inputs) has been called more than once? Maintaining the state is rather tricky according to my experience. If the new function only issues I/O hints to the RandomAccessFile, probably it is much easier to reason about the behavior directly from RandomAccessFile.

amol- · 2023-03-30T17:26:22Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

Split arrow::FileReader::ReadRowGroups() for flexible async IO

308af41

Hor911 requested a review from wjones127 as a code owner March 5, 2023 23:37

github-actions bot added Component: C++ Component: Parquet awaiting review Awaiting review labels Mar 5, 2023

kou changed the title ~~GH-34460: [C++] Split arrow::FileReader::ReadRowGroups() for flexible async IO~~ GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO Mar 6, 2023

Clang formatting

a82d751

robot-piglet pushed a commit to ydb-platform/ydb that referenced this pull request Mar 6, 2023

[C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible …

ff7e0c3

…async IO #34461 apache/arrow#34461

wjones127 reviewed Mar 6, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 6, 2023

amol- closed this Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO #34461

GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO #34461

Uh oh!

Hor911 commented Mar 5, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Mar 5, 2023

Uh oh!

github-actions bot commented Mar 5, 2023

Uh oh!

wjones127 Mar 6, 2023

Uh oh!

Hor911 Mar 6, 2023

Uh oh!

westonpace Mar 6, 2023

Uh oh!

wgtmac Mar 7, 2023

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		virtual ::arrow::Status WillNeedRowGroups(const std::vector<int>& row_groups,
		const std::vector<int>& column_indices) = 0;

GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO #34461

GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO #34461

Uh oh!

Conversation

Hor911 commented Mar 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Mar 5, 2023

Uh oh!

github-actions bot commented Mar 5, 2023

Uh oh!

wjones127 Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

Hor911 Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

wgtmac Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Hor911 commented Mar 5, 2023 •

edited

Loading