-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO #34461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
| virtual ::arrow::Status WillNeedRowGroups(const std::vector<int>& row_groups, | ||
| const std::vector<int>& column_indices) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the caller know when the row groups are loaded? Should this return a Future instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't be expressed in this API. This method is translated into call of arrow::io::RandomAccessFile::WillNeed()
No-op is default and valid implementation of WillNeed. It means that no preload/prefetch is provided in this RAF implementation. All work will be done when ReadAt or ReadAsync is called.
Current Arrow API expect tight coupling between FileReader, ParquetFileReader and intermediate Cache. It is not possible to provide true async decoupling w/o significant API changes (it was discussed somewhere).
For my technique to work, one should provide special implementation of arrow::io::RandomAccessFile which will receive WillNeed, download the data and signals it in some "hidden" way. Not perfect, but possible to reach what I needed w/o API changes and any other side effects.
I think I'll be able to provide you link ro real use case tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#14723 adds a filesystem method for "read many". I would like to see this method support plugging and splitting in the same way that ReadRangeCache does today (then, ReadRangeCache will only be needed if you need true "caching"). Then I think we can use that instead of the ReadRangeCache.
This will allow local filesystems to rely on the OS for plugging & splitting and will allow remote filesystems like S3 to adapt the algorithm to their needs. It's also async and returns a future reliably so you can then return a future from this method (I agree that would be desired).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @westonpace's suggestion.
In addition, what if WillNeedRowGroups (w/ or w/o same inputs) has been called more than once? Maintaining the state is rather tricky according to my experience. If the new function only issues I/O hints to the RandomAccessFile, probably it is much easier to reason about the behavior directly from RandomAccessFile.
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
Rationale for this change
Current implementation of arrow::FileReader::ReadRowGroups() has sync interface. It complicates use in environments where additional threads are undesirable. Splitting this method into 2 parts will fix it. Details and usage examples are inside issue description.
What changes are included in this PR?
Two new methods in arrow::FileReader class
Are these changes tested?
Changes were tested actively in our private repository.
Are there any user-facing changes?
No changes to the current functionality. PR is simple enough and expects no regression.