Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Nov 27, 2022

This improves Avro scan performance when using PyArrowFileIO by about 10x by ensuring that scans are buffered and updating EOF handling. I was testing scan planning and found that S3 planning took about 130 seconds. Buffering the input stream and avoiding the len(input_file) call gets the planning time for the same query to 13 seconds.

The main problem was that the stream provided by PyArrow was not buffered, so nearly every read operation was causing another request to S3.

A second issue was that the call to get the file length was expensive, so this also updates the decoder to throw EOFException rather than checking file length. This improved the performance by about 1 second in my test.

read_bytes = self._input_stream.read(n)
if len(read_bytes) != n:
read_len = len(read_bytes)
if read_len <= 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, EOF is signaled by returning 0 bytes.

self.reader = visit(self.schema, ConstructReader())
else:
self.reader = resolve(self.schema, self.read_schema)
self.reader = cast(StructReader, resolve(self.schema, self.read_schema))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a type warning.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks for figuring this out! 🚀

@Fokko Fokko merged commit f86f3ee into apache:master Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants