Skip to content

Perf: cache and serialize partition metadata across sessions #133

@alxmrs

Description

@alxmrs

Problem

partition_metadata() in df.py recomputes min/max coordinate bounds for all partitions every time read_xarray_table() is called. For ARCO-ERA5 (732,072 partitions), this adds startup latency on every new session even though the coordinate layout of the dataset never changes.

For remote datasets (GCS/S3), each coordinate access has network latency — making this especially costly.

Proposed API

table = read_xarray_table(
    ds,
    chunks={'time': 1},
    metadata_cache='./era5_meta.parquet'
)
# First call: computes and saves bounds to cache file
# Subsequent calls: loads bounds from cache, skipping 732,072 coordinate reads

The partition bounds are a pure function of: dataset path + chunk specification, so caching is safe as long as the dataset structure doesn't change.

Storage formats to consider

  1. Parquet sidecar file (efficient, columnar)
  2. JSON sidecar file (human-readable, debuggable)
  3. Zarr consolidated metadata attributes (colocated with dataset)

Parent: #126

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions