Fix dask aggregate boundary contamination and clean up cumulative bookkeeping by brendancol · Pull Request #1477 · xarray-contrib/xarray-spatial

brendancol · 2026-05-04T19:54:03Z

Summary

Aggregate dask overlap now uses boundary=np.nan (cupy.nan in the cupy mirror). The aggregate kernels already skip NaN inputs and return NaN for empty windows, so the pad is ignored. The existing _agg_block_np indexing happens to keep output windows out of the global-edge pad, but the NaN boundary makes that contract explicit instead of relying on it.
Compute the scale-driven and depth-driven min_size together, call _ensure_min_chunksize once, then build the cumulative arrays once. Removes the wasted first compute and the recompute branch that used data.chunks[0] != tuple(cum_in_y[1:] - cum_in_y[:-1]) as a roundabout chunk-equality check.
Interp dask paths stay on boundary='nearest' so they keep matching the scipy mode='nearest' semantics that the eager numpy interp path uses.

Test plan

pytest xrspatial/tests/test_resample.py passes (68 tests: 62 existing + 6 new).
New TestAggregateDaskBoundary pins dask aggregate min/max/median to bit-equal eager numpy:
- test_chunk_spanning_window_bit_identical: chunks (7, 7) on a 24x24 input force output windows to span chunk boundaries.
- test_global_edge_extremes_match_eager: 999 and -999 on every outer row/column.
Existing aggregate-parity tests still pass for average, min, max, median, mode.
Interp dask parity (nearest, bilinear, cubic) unchanged.

Note on the boundary change

The _agg_block_np indexing reads from block[depth_y : depth_y + chunk_size, ...], which never touches the global-edge pad. So boundary='nearest' was not actually corrupting results in the current code, and the new tests would not have failed pre-fix. The change is still worth making:

It removes a fragile coupling between kernel indexing and the overlap padding choice. If someone later extends _agg_block_np to walk further into the block, NaN padding fails safely while 'nearest' silently biases min/max/median.
It matches how every other map_overlap call site in xrspatial sets its boundary.

The bookkeeping cleanup is the substantive fix; the boundary change is defence-in-depth.

…1469) Switch the aggregate dask path's overlap to boundary=np.nan so any pad cells the kernels might read are skipped naturally (the kernels already ignore NaN). Compute the depth-driven minimum chunk size up front, combine it with the scale-driven minimum, and call _ensure_min_chunksize once -- removing the wasted first cumulative-array compute and the roundabout chunk-equality recompute branch. Mirror the same change in _run_dask_cupy. Leave the interp dask paths on boundary='nearest' so they keep matching scipy's mode='nearest' semantics that the eager numpy interp path uses. Add tests that pin dask aggregate min/max/median to bit-equal eager numpy for chunk-spanning windows and for arrays with extreme values on the global edges.

github-actions Bot added the performance PR touches performance-sensitive code label May 4, 2026

brendancol merged commit 8b860d1 into main May 4, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dask aggregate boundary contamination and clean up cumulative bookkeeping#1477

Fix dask aggregate boundary contamination and clean up cumulative bookkeeping#1477
brendancol merged 1 commit into
mainfrom
issue-1469

brendancol commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented May 4, 2026

Summary

Test plan

Note on the boundary change

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant