Skip to content

Perf: document and support multi-dimensional chunking for pruning on non-time dimensions #132

@alxmrs

Description

@alxmrs

Problem

Filter pushdown currently only prunes on the chunking dimension. For ARCO-ERA5 chunked by time, a query like:

SELECT AVG(temperature) FROM ds WHERE lat BETWEEN 30 AND 60

cannot prune any partitions — lat is not the partition key. All 732,072 time partitions are read even though only ~33% of rows in each partition satisfy the filter.

Existing support

block_slices() in df.py already supports multi-dimensional chunking:

chunks = {'time': 1, 'lat': 90, 'lon': 180}  # chunk on all dims

PrunableStreamingTable.prune_partitions() already stores bounds for all chunked dimensions per partition — it should handle multi-dim metadata correctly today.

Work required

  1. Add documentation and examples showing multi-dim chunking for better pruning
  2. Verify that pruning works correctly with multi-dim metadata (likely already works)
  3. Add tests covering multi-dimensional chunk pruning
  4. Recommend chunk sizes that balance partition count vs. partition size for ERA5-scale datasets

Parent: #126

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions