Skip to content

Conversation

@niyue
Copy link
Contributor

@niyue niyue commented Oct 20, 2021

This PR tries to fix https://issues.apache.org/jira/browse/ARROW-12683 ([C++] Enable fine-grained I/O (coalescing) in IPC reader)

This is my first PR for arrow, please forgive my ignorance and let me know the issues for code format/convention/etc.
And probably I chose a wrong issue as the first problem I want to contribute since after investigating this issue for a while, I realize it is more difficult than I expected :(

Currently I chose an approach that can re-use the current code as much as possible in ArrayLoader, to do that, I use a no-op random access file to record the IO and replay only the necessary read operation later. But I am not certain if this is the best approach for solving this issue, and if this kind of approach doesn't fit, feel free to reject this PR, and please let me know how this should be done and I can give it another try.

Besides passing the unit tests, I verified the IO behavior under Linux manually by watching the file pages loaded in page cache, and it works largely as I expected, and the IO saving varies depending on the specific field to be accessed.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Contributor Author

@niyue niyue Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduce a IoRecordedRandomAccessFile class which will record the read IO operations performed, and it does nothing but save these read operations as <offset, length> pair in a vector, and it is replayed later to do the real IO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recorded read IO operations are replayed here to really reading data from the file into body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on if included_fields are used in IpcReadOptions, here either fields_loader will be used to load each field's buffers or the entire body will be loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayLoader is re-used to load fields subset. This is similar logic as the piece for decompressing/constructing each field's array according to included fields, but here it only uses the no-op random access file to load buffer and does nothing else for processing the loaded buffer (the loaded buffer is always null for the no-op random access file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fields_loader is passed as lambda to the bottom layer (message.cc) so that we don't have to duplicate the code of ArrayLoader both in message.cc and reader.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of code seem redundant and the context value is shadowed by another local variable below, so I remove this line.

@emkornfield
Copy link
Contributor

@niyue Thanks for the PR. it looks like the CI is likely highlighting real issues with the PR, would you mind fixing those?

@niyue niyue changed the title Support reading arrow IPC file with fine grained IO ARROW-12683 [C++] Enable fine-grained I/O (coalescing) in IPC reader Oct 20, 2021
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@niyue
Copy link
Contributor Author

niyue commented Oct 20, 2021

@niyue Thanks for the PR. it looks like the CI is likely highlighting real issues with the PR, would you mind fixing those?

Sure. Let me try it out.

@niyue niyue force-pushed the feature/fine_grained_io branch from e6f44b2 to 2bcaeb7 Compare October 21, 2021 00:31
@niyue
Copy link
Contributor Author

niyue commented Oct 21, 2021

@emkornfield I pushed a new commit trying to fix the issue reported by CI, but it seems the new running CI job failed because CI failed to download "MinIO.exe" (probably a temporary network issue in CI), how can I trigger the CI again?

@niyue niyue force-pushed the feature/fine_grained_io branch from 2bcaeb7 to f21831a Compare October 21, 2021 07:34
@niyue
Copy link
Contributor Author

niyue commented Oct 21, 2021

@emkornfield I think I've fixed the issues reported by CI, could you please help to confirm? Thanks.

@emkornfield
Copy link
Contributor

Yes, looks like logic issues are fixed. Still one small style/lint issue:

/arrow/cpp/src/arrow/ipc/io_recorded_random_access_file.h:40:  You don't need a ; after a }  [readability/braces] [4]
/arrow/cpp/src/arrow/ipc/io_recorded_random_access_file.h:82:  Could not find a newline character at the end of the file.  [whitespace/ending_newline] [5]
/arrow/cpp/src/arrow/ipc/io_recorded_random_access_file.cc:18:  Include the directory when naming .h files  [build/include_subdir] [4]

@niyue niyue force-pushed the feature/fine_grained_io branch from f21831a to 83a2969 Compare October 22, 2021 05:01
@niyue
Copy link
Contributor Author

niyue commented Oct 22, 2021

@emkornfield thanks. I pushed a new commit to fix the lint issue.

UPDATE: There is still one lint issue, and I've fixed it in commit 23b8a34

@niyue niyue force-pushed the feature/fine_grained_io branch 2 times, most recently from 23b8a34 to bfe443e Compare October 23, 2021 00:39
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, let me apologize for taking so long to get to this, I sort of missed it.
Second, I'm happy this is being looked into, this is definitely something I was hoping we could address as part of 7.0.0.

Generally, I am in approval of this PR. Ideally, I would like a unit test showing that a RecordBatchFileReader truly does not read the entire file (possibly the MockFileSystem could be enhanced to record a total # of bytes read).

IoRecordedRandomAccessFile is a clever way to work around the fact that message.cc does not know about arrays and reader.cc does not know about file I/O. However, it is an odd implementation of io::RandomAccessFile since it doesn't return data. It relies on the fact that the reader is going to make read calls but never look at the data (which, admittedly, should be a pretty safe assumption going forward). I will add that this confusion between reader.cc and message.cc made it difficult to implement the asynchronous version of the record batch file reader (RecordBatchFileReader::GetRecordBatchGenerator, more on that later).

That being said, I am worried about the complexity that is growing here. Reading IPC files should be fairly straightforward. I'm worried that the abstractions in message.cc and reader.cc is causing more work than necessary. Maybe the ArrayLoader can move into message.cc. I don't think this is something we need to tackle in this PR, but maybe we should look at it in the 7.0.0 timeframe still (perhaps as part of ARROW-14429).

If we do go forward with the refactoring then maybe we won't need the complexity of IoRecordedRandomAccessFile.

One last complication. This approach currently does not work for the asynchronous version of the reader. As I mentioned above the abstractions caused some difficulty and there is some duplication. The asynchronous path ends up at IpcFileRecordBatchGenerator::ReadRecordBatch and the messages are all queued up to be read in IpcFileRecordBatchGenerator::operator(). This can be handled in a follow-up. If you want to do that then please create a JIRA ticket for that work so we don't lose track.

@westonpace
Copy link
Member

@lidavidm Thoughts on the above?

@lidavidm
Copy link
Member

Broadly I'm in agreement. I like the approach here but it does seem we will want to 'fuse' some of the layers to get the best implementation. The duplication with the asynchronous path is one such issue.

I agree we will want a unit test to ensure the bytes read is as expected. Additionally, another candidate for a follow-up item is to include the I/O coalescer so that we don't suffer on remote filesystems. (This could also be folded into ARROW-14429; another thing we could fold into there is to make ReadRangeCache not hold on to memory for the use case of linearly scanning a file.)

Also, I think IoRecordedRandomAccessFile should be moved into reader.cc - it has fairly specific use and I don't see a reason to expose it more broadly, especially since we aren't subjecting it to unit tests separately.

@niyue
Copy link
Contributor Author

niyue commented Oct 29, 2021

@westonpace

I would like a unit test showing that a RecordBatchFileReader truly does not read the entire file

Sure. Let me see how I can add more tests for it.

Maybe the ArrayLoader can move into message.cc

I considered this approach as well, and I found this will introduce more changes to the existing reader/message APIs and I am not quite sure if this is desirable. It seems to me the current implementation places all concepts about arrow's structures in reader.cc while all flatbuffers structures are kept in the lower layer message.cc file. ArrayLoader involves quite a lot arrow structures, and I am not familiar with some of them, so I try to follow current organization to make it work so far.

This approach currently does not work for the asynchronous version of the reader... This can be handled in a follow-up.
If you want to do that then please create a JIRA ticket for that work so we don't lose track.

I didn't realize this previously since in my project I only use the sync version of the reader. I will look into it later. Since ARROW-12683 is not specific to sync version of the reader, if this PR is accepted, I think probably we can close ARROW-12683 and I will create a JIRA issue to track the async version of the reader enhancement as follow-up. What do you think?

@niyue
Copy link
Contributor Author

niyue commented Oct 29, 2021

@lidavidm

we will want a unit test to ensure the bytes read is as expected

Sure. I will look into it how more unit tests can be added.

Additionally, another candidate for a follow-up item is to include the I/O coalescer so that we don't suffer on remote filesystems.

In my test under Linux, I found Linux will do read ahead IO. In my limited testing, depending on read ahead configuration in Linux, the IO may be 2x than the minimum necessary if the access pattern is random access and the persisted record batch is small. I don't look into how S3FileSystem handles this, but even on local file system, posix_fadvise is desirable to advise operating system the access pattern. Currently, file.cc has some support for POSIX_FADV_WILLNEED, it will be great if other patterns can be supported there, but this is likely another independent area to improve.

I think IoRecordedRandomAccessFile should be moved into reader.cc

No problem. I will move it into reader.cc.

@westonpace
Copy link
Member

ArrayLoader involves quite a lot arrow structures, and I am not familiar with some of them, so I try to follow current organization to make it work so far.

Ok. That is fine. Thank you for considering.

I think probably we can close ARROW-12683 and I will create a JIRA issue to track the async version of the reader enhancement as follow-up. What do you think?

Sounds great.

In my test under Linux, I found Linux will do read ahead IO...

I did some testing with POSIX_FADV_WILLNEED and didn't ever see much benefit over Linux's builtin readahead.

I don't look into how S3FileSystem handles this

It does not currently handle this. We get pretty poor performance with the IPC reader on S3 because there is no readahead / batching (and there is a high latency per request). Handling this at the filesystem level is an interesting thought. The challenge will be that the filesystem is parallel so we sometimes want to allow multiple reads (instead of queuing and plugging/merging) but the filesystem doesn't know the access pattern. Maybe we can still come up with a good strategy. We have ARROW-14429 for this already so no need to solve this problem right now.

@niyue niyue force-pushed the feature/fine_grained_io branch from bfe443e to 6e8c665 Compare November 1, 2021 13:13
@lidavidm
Copy link
Member

lidavidm commented Nov 1, 2021

Sorry, so just to be clear:

I think the issues are because ARROW_FILESYSTEM needs to be ON. You shouldn't need to mess with ARROW_EXPORT or anything (I commented that before seeing the rest of what had been added). However, instead of all that, I think wrapping RandomAccessFile is easier than using the whole filesystem machinery.

@niyue niyue force-pushed the feature/fine_grained_io branch 2 times, most recently from 8b8f157 to 42094f9 Compare November 1, 2021 23:49
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm I copy the TrackedRandomAccessFile into this PR, and tracking the read ranges using a vector, since I think the num_reads is just the length of this vector, I remove the read_ member variable in https://github.com/apache/arrow/pull/11535/files#diff-900c46995b5706697d6e4b010f610f1a1cf27d4d865afe48de0a800830ac676bL1708

@niyue
Copy link
Contributor Author

niyue commented Nov 1, 2021

@westonpace @lidavidm Instead of using mockfs, I simplified the unit testing by using the TrackedRandomAccessFile suggested by David, now there is no change for CMake file, and there is no change to the io::BufferReader API. Could you please help to review?

BTW, is there any documentation describing how I can run the clang-format like the CI job? I find my change sometimes breaks CI lint job, but I've no idea what the recommended approach is for running clang format locally for this project.

@kou
Copy link
Member

kou commented Nov 2, 2021

BTW, is there any documentation describing how I can run the clang-format like the CI job? I find my change sometimes breaks CI lint job, but I've no idea what the recommended approach is for running clang format locally for this project.

Here is the documentation:

Code Style, Linting, and CI
https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests. This is looking good to me, minus one nit about advancing the position.

@niyue niyue force-pushed the feature/fine_grained_io branch 2 times, most recently from fee5df5 to 1e17dd7 Compare November 2, 2021 03:12
@westonpace
Copy link
Member

I goofed slightly. @lidavidm contacted me externally and pointed out that only Read should advance the position_ (and not ReadAt). I've submitted a PR to your branch that should address this. Feel free to use that or make the change some other way. Sorry for the mixup.

@niyue niyue force-pushed the feature/fine_grained_io branch from c22d68b to 49cc30c Compare November 2, 2021 12:49
@niyue
Copy link
Contributor Author

niyue commented Nov 2, 2021

I goofed slightly. @lidavidm contacted me externally and pointed out that only Read should advance the position_ (and not ReadAt). I've submitted a PR to your branch that should address this. Feel free to use that or make the change some other way. Sorry for the mixup.

It is me that really should do more research on this topic. Thanks so much for the PR, and I've merged it and squashed the commits into single one, please check it out.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Overall this looks good, I left some feedback on style things.

@niyue niyue force-pushed the feature/fine_grained_io branch from 49cc30c to 515381f Compare November 2, 2021 14:10
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing things!

I left a couple more comments just based on looking at CI.

@niyue niyue force-pushed the feature/fine_grained_io branch from 515381f to 7ab7ed5 Compare November 2, 2021 23:02
@westonpace
Copy link
Member

Looks like one last CI formatting thing:

/arrow/cpp/src/arrow/ipc/reader_internal.h:84:  Could not find a newline character at the end of the file.  [whitespace/ending_newline] [5]

…ndom access file to record the read ranges and replay only the necessary read operation.
@niyue niyue force-pushed the feature/fine_grained_io branch from 7ab7ed5 to 6e7bfbc Compare November 3, 2021 03:00
@niyue
Copy link
Contributor Author

niyue commented Nov 3, 2021

Looks like one last CI formatting thing:

/arrow/cpp/src/arrow/ipc/reader_internal.h:84:  Could not find a newline character at the end of the file.  [whitespace/ending_newline] [5]

Fixed. For some reason, this format issue was not reported by the lint program in docker image, I ran it like docker-compose run ubuntu-lint.

@westonpace
Copy link
Member

Hmm, I'll have to take a look and see what's up there. Some other ways to run lint are ninja lint and archery lint --cpplint. The former requires you to use the ninja generator when you create your build directory. The latter requires you to install archery (which is a pretty helpful tool for a variety of Arrow development tasks).

There is also ninja format and archery lint --cpplint --fix which will apply some formatting changes automatically.

@lidavidm lidavidm changed the title ARROW-12683 [C++] Enable fine-grained I/O (coalescing) in IPC reader ARROW-12683: [C++] Enable fine-grained I/O (coalescing) in IPC reader Nov 3, 2021
@lidavidm
Copy link
Member

lidavidm commented Nov 3, 2021

@niyue do you have a JIRA account? You can register at https://issues.apache.org/jira/secure/Signup!default.jspa. Then let us know your username and we can assign you the ticket and merge.

@niyue
Copy link
Contributor Author

niyue commented Nov 3, 2021

@lidavidm my JIRA account is niyue, thanks.

@lidavidm lidavidm closed this in 09b79a1 Nov 3, 2021
@ursabot
Copy link

ursabot commented Nov 3, 2021

Benchmark runs are scheduled for baseline = 16af17c and contender = 09b79a1. 09b79a1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@lidavidm
Copy link
Member

lidavidm commented Nov 3, 2021

Congrats on your first contribution! 🎉

@niyue
Copy link
Contributor Author

niyue commented Nov 3, 2021

@lidavidm
As discussed above, I created ARROW-14577 for tracking the IPC reader async read API on this topic, and I linked it with ARROW-12683.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants