Skip to content

[Python] Remove redundant S3 call #33972

@Fokko

Description

@Fokko

Describe the enhancement requested

Hey all,

First of all thanks everyone for working on PyArrow! Really loving it so far. I'm currently working on PyIceberg that will load an Iceberg table in PyArrow. For those unfamiliar with Apache Iceberg. This is a table format that focusses on having huge tables (petabyte size). PyIceberg makes you life easier by taking care of statistics to boost performance, and all the schema maintenance. For example, if you change the partitioning of an Iceberg table, you don't have to directly rewrite all the files, you can do this in an incremental way.

Now I'm running into some performance issues, and I noticed that PyArrow is doing more queries than required to S3. I went down the rabbit hole, and was able to narrow it down to:

import pyarrow.dataset as ds
from pyarrow.fs import S3FileSystem
ONE_MEGABYTE = 1024 * 1024

client_kwargs = {
    "endpoint_override": "http://localhost:9000",
    "access_key": "admin",
    "secret_key": "password",
}
parquet_format = ds.ParquetFileFormat(
    use_buffered_stream=True,
    pre_buffer=True,
    buffer_size=8 * ONE_MEGABYTE
)
fs = S3FileSystem(**client_kwargs)
with fs.open_input_file("warehouse/wh/nyc/taxis/data/tpep_pickup_datetime_day=2022-04-30/00003-4-89e0ad58-fb77-4512-8679-6f26d8d6ef28-00033.parquet") as fout:
    # First get the fragment
    fragment = parquet_format.make_fragment(fout, None)
    print(f"Schema: {fragment.physical_schema}")
    arrow_table = ds.Scanner.from_fragment(
        fragment=fragment
    ).to_table()

I need the schema first, because it can be that a column got renamed, but the the file hasn't been rewritten against the latest schema. The same goes for filtering, if you change a column name, and the file still has the old name in there, then you would like to leverage the predicate pushdown of PyArrow to not load the data in memory at all.

When looking into the minio logs I can see that it does four requests.

  1. A head to check if the file exists
  2. The last 64kb from the Parquet file to get the schema
  3. Another last 64kb from the parquet file to get the schema
  4. A nice beefy 1978578kb request to fetch the data

Looking at the tests, we shouldn't fetch the footer twice:

# with default discovery, no metadata loaded
with assert_opens([fragment.path]):
    fragment.ensure_complete_metadata()
assert fragment.row_groups == [0, 1]

# second time -> use cached / no file IO
with assert_opens([]):
    fragment.ensure_complete_metadata()

Any thoughts or advice? I went through the code a bit already, but my cpp is a bit rusty

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions