-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14429: [C++] Speed up IPC file reader on high-latency filesystems #11535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
I tested this with minio and toxiproxy set up with Median times are given below. Three methods are compared: iterating through all record batches, iterating through all batches using the generator (which also uses coalescing), and using Datasets (async scanner) to read the data as a table. |
|
@lidavidm can we have conbench benchmarks for this case to avoid regressions? |
|
Good point - I'll add them when I get a chance. (Probably I'll artificially add delay in-process to keep the benchmark simple.) |
|
I've added a unit test that counts the number of read operations, instead of a benchmark, since that's a more reliable metric to track for this instance. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice optimizations.
cpp/src/arrow/ipc/message.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: metadata is a slightly inaccurate name now.
cpp/src/arrow/ipc/reader.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this condition is false would it be faster to read just the remaining portion instead of rereading a part of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to just read the missing part of the footer.
cpp/src/arrow/ipc/message.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, this does take padding into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So normally the body_length comes from CheckMetadataAndGetBodyLength which just gets the bodyLength from the Message flatbuffer. The body_length here comes from the FileBlock in the footer, which according to File.fbs should be aligned already. And in writer.cc it looks like we add the padding to the metadata length, so body_length should be OK:
arrow/cpp/src/arrow/ipc/writer.cc
Lines 1234 to 1236 in 16af17c
| // Metadata length must include padding, it's computed by WriteIpcPayload() | |
| FileBlock block = {position_, 0, payload.body_length}; | |
| RETURN_NOT_OK(WriteIpcPayload(payload, options_, sink_, &block.metadata_length)); |
Co-authored-by: Weston Pace <weston.pace@gmail.com>
|
@westonpace I've rebased this, but the interaction with ARROW-12683 leaves something to be desired - your PR might supersede this one. |
|
I'll incorporate these fixes into my PR then. |
|
Closing in favor of #11616. |
This implements two minor optimizations for the IPC file reader: