[Python] Allow passing file sizes to FileSystemDataset from Python

### Describe the enhancement requested

When reading Parquet files from table formats such as Delta Lake, file sizes are already known from the table format metadata. However, when building a dataset from fragments using https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment, there is no way to inform Pyarrow about the file sizes, and this leads to unnecessary HEAD requests in the case of S3. There is already support in Arrow for specifying the file size to avoid these requests to S3, but as far as I can see this is not exposed to PyArrow: https://github.com/apache/arrow/pull/7547

(As a side note, it seems that those HEAD requests in S3Filesystem are always executed on the same thread, which leads to poor concurrency when reading multiple files. Is this a known issue?)

I can try to put together a PR with some kind of an implementation.


### Component(s)

Parquet, Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions