Skip to content

Perf: update open_era5.py to use read_xarray_table with filter pushdown and lazy loading #139

@alxmrs

Description

@alxmrs

Context

perf_tests/open_era5.py currently uses the older read_xarray API, which returns an eager pa.RecordBatchReader. It also only prints the schema, making it impossible to benchmark real query performance.

Current state

# open_era5.py
era5_ds = xr.open_zarr("gs://gcp-public-data-arco-era5/...", chunks={"time": 240, "level": 1})
era5_wind_df = xql.read_xarray(
    era5_ds[["u_component_of_wind", "v_component_of_wind"]]
)
print(era5_wind_df.schema)

Issues:

  1. Uses read_xarray instead of read_xarray_table — loses filter pushdown and lazy partition loading
  2. Only inspects schema, never runs a query — no way to measure actual query performance
  3. Does not demonstrate the partition pruning benefit for time-range queries

Proposed update

# open_era5.py
import time
from datafusion import SessionContext
import xarray as xr
import xarray_sql as xql

era5_ds = xr.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2",
    chunks={"time": 240, "level": 1},
)
wind_ds = era5_ds[["u_component_of_wind", "v_component_of_wind"]]

# read_xarray_table: lazy, supports filter pushdown
table = xql.read_xarray_table(wind_ds, chunks={"time": 240})

ctx = SessionContext()
ctx.register_table("era5_wind", table)

# Filter pushdown: should prune most partitions
t0 = time.perf_counter()
result = ctx.sql(
    "SELECT AVG(u_component_of_wind), AVG(v_component_of_wind) "
    "FROM era5_wind WHERE time > '2020-01-01'"
).to_arrow_table()
print(f"Filter query: {time.perf_counter() - t0:.2f}s  {result}")

Goals

  • Serve as a real-world benchmark for ERA5 (the primary large-scale use case)
  • Show filter pushdown pruning thousands of partitions
  • Compare registration time, filter query time, and full-scan time
  • Can be run with py-spy to generate flame graphs once GCP auth is available

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions