Skip to content

Conversation

@Hor911
Copy link

@Hor911 Hor911 commented Mar 5, 2023

Rationale for this change

Current implementation of arrow::FileReader::ReadRowGroups() has sync interface. It complicates use in environments where additional threads are undesirable. Splitting this method into 2 parts will fix it. Details and usage examples are inside issue description.

What changes are included in this PR?

Two new methods in arrow::FileReader class

Are these changes tested?

Changes were tested actively in our private repository.

Are there any user-facing changes?

No changes to the current functionality. PR is simple enough and expects no regression.

@Hor911 Hor911 requested a review from wjones127 as a code owner March 5, 2023 23:37
@github-actions
Copy link

github-actions bot commented Mar 5, 2023

@github-actions
Copy link

github-actions bot commented Mar 5, 2023

⚠️ GitHub issue #34460 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-34460: [C++] Split arrow::FileReader::ReadRowGroups() for flexible async IO GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO Mar 6, 2023
robot-piglet pushed a commit to ydb-platform/ydb that referenced this pull request Mar 6, 2023
Comment on lines +252 to +253
virtual ::arrow::Status WillNeedRowGroups(const std::vector<int>& row_groups,
const std::vector<int>& column_indices) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the caller know when the row groups are loaded? Should this return a Future instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be expressed in this API. This method is translated into call of arrow::io::RandomAccessFile::WillNeed()

No-op is default and valid implementation of WillNeed. It means that no preload/prefetch is provided in this RAF implementation. All work will be done when ReadAt or ReadAsync is called.

Current Arrow API expect tight coupling between FileReader, ParquetFileReader and intermediate Cache. It is not possible to provide true async decoupling w/o significant API changes (it was discussed somewhere).

For my technique to work, one should provide special implementation of arrow::io::RandomAccessFile which will receive WillNeed, download the data and signals it in some "hidden" way. Not perfect, but possible to reach what I needed w/o API changes and any other side effects.

I think I'll be able to provide you link ro real use case tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14723 adds a filesystem method for "read many". I would like to see this method support plugging and splitting in the same way that ReadRangeCache does today (then, ReadRangeCache will only be needed if you need true "caching"). Then I think we can use that instead of the ReadRangeCache.

This will allow local filesystems to rely on the OS for plugging & splitting and will allow remote filesystems like S3 to adapt the algorithm to their needs. It's also async and returns a future reliably so you can then return a future from this method (I agree that would be desired).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @westonpace's suggestion.

In addition, what if WillNeedRowGroups (w/ or w/o same inputs) has been called more than once? Maintaining the state is rather tricky according to my experience. If the new function only issues I/O hints to the RandomAccessFile, probably it is much easier to reason about the behavior directly from RandomAccessFile.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 6, 2023
@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Split arrow::FileReader::ReadRowGroups() to 2 methods for flexible async IO

5 participants