-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Context
perf_tests/open_era5.py currently uses the older read_xarray API, which returns an eager pa.RecordBatchReader. It also only prints the schema, making it impossible to benchmark real query performance.
Current state
# open_era5.py
era5_ds = xr.open_zarr("gs://gcp-public-data-arco-era5/...", chunks={"time": 240, "level": 1})
era5_wind_df = xql.read_xarray(
era5_ds[["u_component_of_wind", "v_component_of_wind"]]
)
print(era5_wind_df.schema)Issues:
- Uses
read_xarrayinstead ofread_xarray_table— loses filter pushdown and lazy partition loading - Only inspects schema, never runs a query — no way to measure actual query performance
- Does not demonstrate the partition pruning benefit for time-range queries
Proposed update
# open_era5.py
import time
from datafusion import SessionContext
import xarray as xr
import xarray_sql as xql
era5_ds = xr.open_zarr(
"gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2",
chunks={"time": 240, "level": 1},
)
wind_ds = era5_ds[["u_component_of_wind", "v_component_of_wind"]]
# read_xarray_table: lazy, supports filter pushdown
table = xql.read_xarray_table(wind_ds, chunks={"time": 240})
ctx = SessionContext()
ctx.register_table("era5_wind", table)
# Filter pushdown: should prune most partitions
t0 = time.perf_counter()
result = ctx.sql(
"SELECT AVG(u_component_of_wind), AVG(v_component_of_wind) "
"FROM era5_wind WHERE time > '2020-01-01'"
).to_arrow_table()
print(f"Filter query: {time.perf_counter() - t0:.2f}s {result}")Goals
- Serve as a real-world benchmark for ERA5 (the primary large-scale use case)
- Show filter pushdown pruning thousands of partitions
- Compare registration time, filter query time, and full-scan time
- Can be run with
py-spyto generate flame graphs once GCP auth is available
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels