Skip to content

[Parquet][Python] Potential regression in Parquet parallel reading #38591

@eeroel

Description

@eeroel

Describe the enhancement requested

UPDATE: this is looking more like a bug on closer look. What happens:

When calling to_table() on a FileSystemDataset in Python using pyarrow.fs.S3FileSystem,

  • Using 02de3c1, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.
  • With current main, there are two HEAD requests and three GET requests for each file. Also, the first HEAD request is made from the main thread so the downloads are started sequentially. I would expect to see only one HEAD request, not sure if the three GET are expected due to some change.

Here's an example using 02de3c1, reading a FileSystemDataset using fragment_readahead = 100 and io concurrency set to 100; Y-axis represents files and X-axis is time in seconds, and each point is the relative start time of a request (HEAD or GET):
Screenshot 2023-11-05 at 18 43 54

With the current main fc8c6b7 it seems that the first request for each file is made from the same thread (blue), and notably there are five requests per each file.

Screenshot 2023-11-05 at 18 43 44

See comment below for reproducible example.

I'm running on Max OS 14.1.

Component(s)

Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions