Skip to content

Python: Improve PyArrow performance #6475

@Fokko

Description

@Fokko

Apache Iceberg version

None

Query engine

None

Please describe the bug 🐞

I noticed that s3fs is much faster than PyArrow. @rdblue also noticed this and added a buffer in #6283

First, we want to have a benchmark to get a baseline and then check what's going on. We could check locally against minio to see how many requests are being made by enabling audit logging.

Also noticed that we open files using open_input_file instead of open_input_stream. The latter also allows us to buffer as well and makes much more sense for our use case where we sequentially run through the Avro files.

image

Using the benchmark we can check what the differences are with using open_input_stream, and see if we need to wrap a buffered reader on our side. We can validate this using the benchmark.

https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html#pyarrow.fs.S3FileSystem.open_input_stream

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions