Python: Fix Avro scan performance in PyArrow #6283

rdblue · 2022-11-27T19:53:30Z

This improves Avro scan performance when using PyArrowFileIO by about 10x by ensuring that scans are buffered and updating EOF handling. I was testing scan planning and found that S3 planning took about 130 seconds. Buffering the input stream and avoiding the len(input_file) call gets the planning time for the same query to 13 seconds.

The main problem was that the stream provided by PyArrow was not buffered, so nearly every read operation was causing another request to S3.

A second issue was that the call to get the file length was expensive, so this also updates the decoder to throw EOFException rather than checking file length. This improved the performance by about 1 second in my test.

rdblue · 2022-11-27T20:38:26Z

python/pyiceberg/avro/decoder.py

        read_bytes = self._input_stream.read(n)
-        if len(read_bytes) != n:
+        read_len = len(read_bytes)
+        if read_len <= 0:


In Python, EOF is signaled by returning 0 bytes.

rdblue · 2022-11-27T20:38:41Z

python/pyiceberg/avro/file.py

            self.reader = visit(self.schema, ConstructReader())
        else:
-            self.reader = resolve(self.schema, self.read_schema)
+            self.reader = cast(StructReader, resolve(self.schema, self.read_schema))


This fixes a type warning.

Fokko

This is awesome, thanks for figuring this out! 🚀

Python: Fix Avro scan performance.

0b4e455

github-actions bot added the python label Nov 27, 2022

rdblue mentioned this pull request Nov 27, 2022

Update for review comments Fokko/iceberg#331

Merged

rdblue requested a review from Fokko November 27, 2022 20:01

Python: Fix EOF handling.

a680313

rdblue commented Nov 27, 2022

View reviewed changes

Fokko approved these changes Nov 28, 2022

View reviewed changes

Fokko merged commit f86f3ee into apache:master Nov 28, 2022

Fokko mentioned this pull request Dec 21, 2022

Python: Improve PyArrow performance #6475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Fix Avro scan performance in PyArrow #6283

Python: Fix Avro scan performance in PyArrow #6283

Uh oh!

rdblue commented Nov 27, 2022 •

edited

Loading

Uh oh!

rdblue Nov 27, 2022

Uh oh!

rdblue Nov 27, 2022

Uh oh!

Fokko left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Python: Fix Avro scan performance in PyArrow #6283

Python: Fix Avro scan performance in PyArrow #6283

Uh oh!

Conversation

rdblue commented Nov 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue Nov 27, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 27, 2022

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdblue commented Nov 27, 2022 •

edited

Loading