Skip to content

Conversation

@lidavidm
Copy link
Member

This implements two minor optimizations for the IPC file reader:

  • When reading an IPC message, try to read the entire body and header in one call. (This was already implemented for the async reader, but not the regular synchronous reader.)
  • Like the Parquet reader, try to avoid a second I/O operation when reading the footer by speculatively reading extra data.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member Author

I tested this with minio and toxiproxy set up with toxiproxy-cli-linux-amd64 toxic add -n latency -t latency --attribute latency=100 s3. Now, this is rather unrealistic - this is a lot more latency than you should expect from S3, unless you're doing a cross-region read - but it highlights the cost of I/O in this case.

Median times are given below. Three methods are compared: iterating through all record batches, iterating through all batches using the generator (which also uses coalescing), and using Datasets (async scanner) to read the data as a table.

Baseline:
Iterator: 5.54072s
Generator: 0.560195s
Datasets: 1.39329s

With the IPC message optimization:
Iterator: 2.95526s
Generator: 0.561748s
Datasets: 1.39662s

With the IPC message optimization and the footer optimization:
Iterator: 2.84875s
Generator: 0.456949s
Datasets: 1.08955s

@kszucs
Copy link
Member

kszucs commented Oct 27, 2021

@lidavidm can we have conbench benchmarks for this case to avoid regressions?

@lidavidm
Copy link
Member Author

Good point - I'll add them when I get a chance. (Probably I'll artificially add delay in-process to keep the benchmark simple.)

@lidavidm
Copy link
Member Author

I've added a unit test that counts the number of read operations, instead of a benchmark, since that's a more reliable metric to track for this instance.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimizations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: metadata is a slightly inaccurate name now.

Comment on lines +1259 to +1305
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this condition is false would it be faster to read just the remaining portion instead of rereading a part of the file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to just read the missing part of the footer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, this does take padding into account?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So normally the body_length comes from CheckMetadataAndGetBodyLength which just gets the bodyLength from the Message flatbuffer. The body_length here comes from the FileBlock in the footer, which according to File.fbs should be aligned already. And in writer.cc it looks like we add the padding to the metadata length, so body_length should be OK:

// Metadata length must include padding, it's computed by WriteIpcPayload()
FileBlock block = {position_, 0, payload.body_length};
RETURN_NOT_OK(WriteIpcPayload(payload, options_, sink_, &block.metadata_length));

@lidavidm
Copy link
Member Author

lidavidm commented Nov 5, 2021

@westonpace I've rebased this, but the interaction with ARROW-12683 leaves something to be desired - your PR might supersede this one.

@westonpace
Copy link
Member

I'll incorporate these fixes into my PR then.

@lidavidm
Copy link
Member Author

lidavidm commented Dec 8, 2021

Closing in favor of #11616.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants