-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
partition_metadata() in df.py recomputes min/max coordinate bounds for all partitions every time read_xarray_table() is called. For ARCO-ERA5 (732,072 partitions), this adds startup latency on every new session even though the coordinate layout of the dataset never changes.
For remote datasets (GCS/S3), each coordinate access has network latency — making this especially costly.
Proposed API
table = read_xarray_table(
ds,
chunks={'time': 1},
metadata_cache='./era5_meta.parquet'
)
# First call: computes and saves bounds to cache file
# Subsequent calls: loads bounds from cache, skipping 732,072 coordinate readsThe partition bounds are a pure function of: dataset path + chunk specification, so caching is safe as long as the dataset structure doesn't change.
Storage formats to consider
- Parquet sidecar file (efficient, columnar)
- JSON sidecar file (human-readable, debuggable)
- Zarr consolidated metadata attributes (colocated with dataset)
Parent: #126
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request