-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the enhancement requested
UPDATE: this is looking more like a bug on closer look. What happens:
When calling to_table() on a FileSystemDataset in Python using pyarrow.fs.S3FileSystem,
- Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.
- With current
main, there are two HEAD requests and three GET requests for each file. Also, the first HEAD request is made from the main thread so the downloads are started sequentially. I would expect to see only one HEAD request, not sure if the three GET are expected due to some change.
Here's an example using 02de3c1, reading a FileSystemDataset using fragment_readahead = 100 and io concurrency set to 100; Y-axis represents files and X-axis is time in seconds, and each point is the relative start time of a request (HEAD or GET):

With the current main fc8c6b7 it seems that the first request for each file is made from the same thread (blue), and notably there are five requests per each file.
See comment below for reproducible example.
I'm running on Max OS 14.1.
Component(s)
Parquet