Skip to content

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

@eeroel

Description

@eeroel

Describe the enhancement requested

When reading Parquet files from table formats such as Delta Lake, file sizes are already known from the table format metadata. However, when building a dataset from fragments using https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment, there is no way to inform Pyarrow about the file sizes, and this leads to unnecessary HEAD requests in the case of S3. There is already support in Arrow for specifying the file size to avoid these requests to S3, but as far as I can see this is not exposed to PyArrow: #7547

(As a side note, it seems that those HEAD requests in S3Filesystem are always executed on the same thread, which leads to poor concurrency when reading multiple files. Is this a known issue?)

I can try to put together a PR with some kind of an implementation.

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions