-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12683: [C++] Enable fine-grained I/O (coalescing) in IPC reader #11486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduce a IoRecordedRandomAccessFile class which will record the read IO operations performed, and it does nothing but save these read operations as <offset, length> pair in a vector, and it is replayed later to do the real IO.
cpp/src/arrow/ipc/message.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recorded read IO operations are replayed here to really reading data from the file into body.
cpp/src/arrow/ipc/message.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on if included_fields are used in IpcReadOptions, here either fields_loader will be used to load each field's buffers or the entire body will be loaded.
cpp/src/arrow/ipc/reader.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayLoader is re-used to load fields subset. This is similar logic as the piece for decompressing/constructing each field's array according to included fields, but here it only uses the no-op random access file to load buffer and does nothing else for processing the loaded buffer (the loaded buffer is always null for the no-op random access file)
cpp/src/arrow/ipc/reader.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fields_loader is passed as lambda to the bottom layer (message.cc) so that we don't have to duplicate the code of ArrayLoader both in message.cc and reader.cc
cpp/src/arrow/ipc/reader.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of code seem redundant and the context value is shadowed by another local variable below, so I remove this line.
|
@niyue Thanks for the PR. it looks like the CI is likely highlighting real issues with the PR, would you mind fixing those? |
|
|
Sure. Let me try it out. |
e6f44b2 to
2bcaeb7
Compare
|
@emkornfield I pushed a new commit trying to fix the issue reported by CI, but it seems the new running CI job failed because CI failed to download "MinIO.exe" (probably a temporary network issue in CI), how can I trigger the CI again? |
2bcaeb7 to
f21831a
Compare
|
@emkornfield I think I've fixed the issues reported by CI, could you please help to confirm? Thanks. |
|
Yes, looks like logic issues are fixed. Still one small style/lint issue: |
f21831a to
83a2969
Compare
|
@emkornfield thanks. I pushed a new commit to fix the lint issue. UPDATE: There is still one lint issue, and I've fixed it in commit 23b8a34 |
23b8a34 to
bfe443e
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, let me apologize for taking so long to get to this, I sort of missed it.
Second, I'm happy this is being looked into, this is definitely something I was hoping we could address as part of 7.0.0.
Generally, I am in approval of this PR. Ideally, I would like a unit test showing that a RecordBatchFileReader truly does not read the entire file (possibly the MockFileSystem could be enhanced to record a total # of bytes read).
IoRecordedRandomAccessFile is a clever way to work around the fact that message.cc does not know about arrays and reader.cc does not know about file I/O. However, it is an odd implementation of io::RandomAccessFile since it doesn't return data. It relies on the fact that the reader is going to make read calls but never look at the data (which, admittedly, should be a pretty safe assumption going forward). I will add that this confusion between reader.cc and message.cc made it difficult to implement the asynchronous version of the record batch file reader (RecordBatchFileReader::GetRecordBatchGenerator, more on that later).
That being said, I am worried about the complexity that is growing here. Reading IPC files should be fairly straightforward. I'm worried that the abstractions in message.cc and reader.cc is causing more work than necessary. Maybe the ArrayLoader can move into message.cc. I don't think this is something we need to tackle in this PR, but maybe we should look at it in the 7.0.0 timeframe still (perhaps as part of ARROW-14429).
If we do go forward with the refactoring then maybe we won't need the complexity of IoRecordedRandomAccessFile.
One last complication. This approach currently does not work for the asynchronous version of the reader. As I mentioned above the abstractions caused some difficulty and there is some duplication. The asynchronous path ends up at IpcFileRecordBatchGenerator::ReadRecordBatch and the messages are all queued up to be read in IpcFileRecordBatchGenerator::operator(). This can be handled in a follow-up. If you want to do that then please create a JIRA ticket for that work so we don't lose track.
|
@lidavidm Thoughts on the above? |
|
Broadly I'm in agreement. I like the approach here but it does seem we will want to 'fuse' some of the layers to get the best implementation. The duplication with the asynchronous path is one such issue. I agree we will want a unit test to ensure the bytes read is as expected. Additionally, another candidate for a follow-up item is to include the I/O coalescer so that we don't suffer on remote filesystems. (This could also be folded into ARROW-14429; another thing we could fold into there is to make ReadRangeCache not hold on to memory for the use case of linearly scanning a file.) Also, I think IoRecordedRandomAccessFile should be moved into reader.cc - it has fairly specific use and I don't see a reason to expose it more broadly, especially since we aren't subjecting it to unit tests separately. |
Sure. Let me see how I can add more tests for it.
I considered this approach as well, and I found this will introduce more changes to the existing reader/message APIs and I am not quite sure if this is desirable. It seems to me the current implementation places all concepts about arrow's structures in reader.cc while all flatbuffers structures are kept in the lower layer message.cc file.
I didn't realize this previously since in my project I only use the sync version of the reader. I will look into it later. Since ARROW-12683 is not specific to sync version of the reader, if this PR is accepted, I think probably we can close ARROW-12683 and I will create a JIRA issue to track the async version of the reader enhancement as follow-up. What do you think? |
Sure. I will look into it how more unit tests can be added.
In my test under Linux, I found Linux will do read ahead IO. In my limited testing, depending on read ahead configuration in Linux, the IO may be 2x than the minimum necessary if the access pattern is random access and the persisted record batch is small. I don't look into how
No problem. I will move it into reader.cc. |
Ok. That is fine. Thank you for considering.
Sounds great.
I did some testing with
It does not currently handle this. We get pretty poor performance with the IPC reader on S3 because there is no readahead / batching (and there is a high latency per request). Handling this at the filesystem level is an interesting thought. The challenge will be that the filesystem is parallel so we sometimes want to allow multiple reads (instead of queuing and plugging/merging) but the filesystem doesn't know the access pattern. Maybe we can still come up with a good strategy. We have ARROW-14429 for this already so no need to solve this problem right now. |
bfe443e to
6e8c665
Compare
|
Sorry, so just to be clear: I think the issues are because ARROW_FILESYSTEM needs to be ON. You shouldn't need to mess with ARROW_EXPORT or anything (I commented that before seeing the rest of what had been added). However, instead of all that, I think wrapping RandomAccessFile is easier than using the whole filesystem machinery. |
8b8f157 to
42094f9
Compare
cpp/src/arrow/ipc/read_write_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm I copy the TrackedRandomAccessFile into this PR, and tracking the read ranges using a vector, since I think the num_reads is just the length of this vector, I remove the read_ member variable in https://github.com/apache/arrow/pull/11535/files#diff-900c46995b5706697d6e4b010f610f1a1cf27d4d865afe48de0a800830ac676bL1708
|
@westonpace @lidavidm Instead of using mockfs, I simplified the unit testing by using the BTW, is there any documentation describing how I can run the |
Here is the documentation: Code Style, Linting, and CI |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the tests. This is looking good to me, minus one nit about advancing the position.
fee5df5 to
1e17dd7
Compare
|
I goofed slightly. @lidavidm contacted me externally and pointed out that only |
c22d68b to
49cc30c
Compare
It is me that really should do more research on this topic. Thanks so much for the PR, and I've merged it and squashed the commits into single one, please check it out. |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. Overall this looks good, I left some feedback on style things.
49cc30c to
515381f
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing things!
I left a couple more comments just based on looking at CI.
515381f to
7ab7ed5
Compare
|
Looks like one last CI formatting thing: |
…ndom access file to record the read ranges and replay only the necessary read operation.
7ab7ed5 to
6e7bfbc
Compare
Fixed. For some reason, this format issue was not reported by the lint program in docker image, I ran it like |
|
Hmm, I'll have to take a look and see what's up there. Some other ways to run lint are There is also |
|
@niyue do you have a JIRA account? You can register at https://issues.apache.org/jira/secure/Signup!default.jspa. Then let us know your username and we can assign you the ticket and merge. |
|
Benchmark runs are scheduled for baseline = 16af17c and contender = 09b79a1. 09b79a1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
Congrats on your first contribution! 🎉 |
|
@lidavidm |
This PR tries to fix https://issues.apache.org/jira/browse/ARROW-12683 ([C++] Enable fine-grained I/O (coalescing) in IPC reader)
This is my first PR for arrow, please forgive my ignorance and let me know the issues for code format/convention/etc.
And probably I chose a wrong issue as the first problem I want to contribute since after investigating this issue for a while, I realize it is more difficult than I expected :(
Currently I chose an approach that can re-use the current code as much as possible in
ArrayLoader, to do that, I use a no-op random access file to record the IO and replay only the necessary read operation later. But I am not certain if this is the best approach for solving this issue, and if this kind of approach doesn't fit, feel free to reject this PR, and please let me know how this should be done and I can give it another try.Besides passing the unit tests, I verified the IO behavior under Linux manually by watching the file pages loaded in page cache, and it works largely as I expected, and the IO saving varies depending on the specific field to be accessed.