-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
Milestone
Description
Apache Iceberg version
None
Query engine
None
Please describe the bug 🐞
I noticed that s3fs is much faster than PyArrow. @rdblue also noticed this and added a buffer in #6283
First, we want to have a benchmark to get a baseline and then check what's going on. We could check locally against minio to see how many requests are being made by enabling audit logging.
Also noticed that we open files using open_input_file instead of open_input_stream. The latter also allows us to buffer as well and makes much more sense for our use case where we sequentially run through the Avro files.
Using the benchmark we can check what the differences are with using open_input_stream, and see if we need to wrap a buffered reader on our side. We can validate this using the benchmark.
