-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
When reading Parquet files from table formats such as Delta Lake, file sizes are already known from the table format metadata. However, when building a dataset from fragments using https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment, there is no way to inform Pyarrow about the file sizes, and this leads to unnecessary HEAD requests in the case of S3. There is already support in Arrow for specifying the file size to avoid these requests to S3, but as far as I can see this is not exposed to PyArrow: #7547
(As a side note, it seems that those HEAD requests in S3Filesystem are always executed on the same thread, which leads to poor concurrency when reading multiple files. Is this a known issue?)
I can try to put together a PR with some kind of an implementation.
Component(s)
Parquet, Python