Python: Improve PyArrow performance

### Apache Iceberg version

None

### Query engine

None

### Please describe the bug 🐞

I noticed that s3fs is much faster than PyArrow. @rdblue also noticed this and added a buffer in https://github.com/apache/iceberg/pull/6283

First, we want to have a benchmark to get a baseline and then check what's going on. We could check locally against minio to see how many requests are being made by enabling audit logging.

Also noticed that we open files using `open_input_file` instead of `open_input_stream`. The latter also allows us to buffer as well and makes much more sense for our use case where we sequentially run through the Avro files.

![image](https://user-images.githubusercontent.com/1134248/208970833-36e2fb04-bacb-4a80-bb8d-d71ef150363a.png)

Using the benchmark we can check what the differences are with using `open_input_stream`, and see if we need to wrap a buffered reader on our side. We can validate this using the benchmark.

https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html#pyarrow.fs.S3FileSystem.open_input_stream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Improve PyArrow performance #6475

Apache Iceberg version

Query engine

Please describe the bug 🐞

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python: Improve PyArrow performance #6475

Description

Apache Iceberg version

Query engine

Please describe the bug 🐞

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions