-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
Hey all,
First of all thanks everyone for working on PyArrow! Really loving it so far. I'm currently working on PyIceberg that will load an Iceberg table in PyArrow. For those unfamiliar with Apache Iceberg. This is a table format that focusses on having huge tables (petabyte size). PyIceberg makes you life easier by taking care of statistics to boost performance, and all the schema maintenance. For example, if you change the partitioning of an Iceberg table, you don't have to directly rewrite all the files, you can do this in an incremental way.
Now I'm running into some performance issues, and I noticed that PyArrow is doing more queries than required to S3. I went down the rabbit hole, and was able to narrow it down to:
import pyarrow.dataset as ds
from pyarrow.fs import S3FileSystem
ONE_MEGABYTE = 1024 * 1024
client_kwargs = {
"endpoint_override": "http://localhost:9000",
"access_key": "admin",
"secret_key": "password",
}
parquet_format = ds.ParquetFileFormat(
use_buffered_stream=True,
pre_buffer=True,
buffer_size=8 * ONE_MEGABYTE
)
fs = S3FileSystem(**client_kwargs)
with fs.open_input_file("warehouse/wh/nyc/taxis/data/tpep_pickup_datetime_day=2022-04-30/00003-4-89e0ad58-fb77-4512-8679-6f26d8d6ef28-00033.parquet") as fout:
# First get the fragment
fragment = parquet_format.make_fragment(fout, None)
print(f"Schema: {fragment.physical_schema}")
arrow_table = ds.Scanner.from_fragment(
fragment=fragment
).to_table()I need the schema first, because it can be that a column got renamed, but the the file hasn't been rewritten against the latest schema. The same goes for filtering, if you change a column name, and the file still has the old name in there, then you would like to leverage the predicate pushdown of PyArrow to not load the data in memory at all.
When looking into the minio logs I can see that it does four requests.
- A head to check if the file exists
- The last 64kb from the Parquet file to get the schema
- Another last 64kb from the parquet file to get the schema
- A nice beefy 1978578kb request to fetch the data
Looking at the tests, we shouldn't fetch the footer twice:
# with default discovery, no metadata loaded
with assert_opens([fragment.path]):
fragment.ensure_complete_metadata()
assert fragment.row_groups == [0, 1]
# second time -> use cached / no file IO
with assert_opens([]):
fragment.ensure_complete_metadata()Any thoughts or advice? I went through the code a bit already, but my cpp is a bit rusty
Component(s)
Python