-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
Filter pushdown currently only prunes on the chunking dimension. For ARCO-ERA5 chunked by time, a query like:
SELECT AVG(temperature) FROM ds WHERE lat BETWEEN 30 AND 60cannot prune any partitions — lat is not the partition key. All 732,072 time partitions are read even though only ~33% of rows in each partition satisfy the filter.
Existing support
block_slices() in df.py already supports multi-dimensional chunking:
chunks = {'time': 1, 'lat': 90, 'lon': 180} # chunk on all dimsPrunableStreamingTable.prune_partitions() already stores bounds for all chunked dimensions per partition — it should handle multi-dim metadata correctly today.
Work required
- Add documentation and examples showing multi-dim chunking for better pruning
- Verify that pruning works correctly with multi-dim metadata (likely already works)
- Add tests covering multi-dimensional chunk pruning
- Recommend chunk sizes that balance partition count vs. partition size for ERA5-scale datasets
Parent: #126
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request