Skip to content

Fix dask aggregate boundary contamination and clean up cumulative bookkeeping#1477

Merged
brendancol merged 1 commit into
mainfrom
issue-1469
May 4, 2026
Merged

Fix dask aggregate boundary contamination and clean up cumulative bookkeeping#1477
brendancol merged 1 commit into
mainfrom
issue-1469

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Fixes #1469.

  • Aggregate dask overlap now uses boundary=np.nan (cupy.nan in the cupy mirror). The aggregate kernels already skip NaN inputs and return NaN for empty windows, so the pad is ignored. The existing _agg_block_np indexing happens to keep output windows out of the global-edge pad, but the NaN boundary makes that contract explicit instead of relying on it.
  • Compute the scale-driven and depth-driven min_size together, call _ensure_min_chunksize once, then build the cumulative arrays once. Removes the wasted first compute and the recompute branch that used data.chunks[0] != tuple(cum_in_y[1:] - cum_in_y[:-1]) as a roundabout chunk-equality check.
  • Interp dask paths stay on boundary='nearest' so they keep matching the scipy mode='nearest' semantics that the eager numpy interp path uses.

Test plan

  • pytest xrspatial/tests/test_resample.py passes (68 tests: 62 existing + 6 new).
  • New TestAggregateDaskBoundary pins dask aggregate min/max/median to bit-equal eager numpy:
    • test_chunk_spanning_window_bit_identical: chunks (7, 7) on a 24x24 input force output windows to span chunk boundaries.
    • test_global_edge_extremes_match_eager: 999 and -999 on every outer row/column.
  • Existing aggregate-parity tests still pass for average, min, max, median, mode.
  • Interp dask parity (nearest, bilinear, cubic) unchanged.

Note on the boundary change

The _agg_block_np indexing reads from block[depth_y : depth_y + chunk_size, ...], which never touches the global-edge pad. So boundary='nearest' was not actually corrupting results in the current code, and the new tests would not have failed pre-fix. The change is still worth making:

  1. It removes a fragile coupling between kernel indexing and the overlap padding choice. If someone later extends _agg_block_np to walk further into the block, NaN padding fails safely while 'nearest' silently biases min/max/median.
  2. It matches how every other map_overlap call site in xrspatial sets its boundary.

The bookkeeping cleanup is the substantive fix; the boundary change is defence-in-depth.

…1469)

Switch the aggregate dask path's overlap to boundary=np.nan so any pad
cells the kernels might read are skipped naturally (the kernels already
ignore NaN). Compute the depth-driven minimum chunk size up front,
combine it with the scale-driven minimum, and call _ensure_min_chunksize
once -- removing the wasted first cumulative-array compute and the
roundabout chunk-equality recompute branch.

Mirror the same change in _run_dask_cupy. Leave the interp dask paths
on boundary='nearest' so they keep matching scipy's mode='nearest'
semantics that the eager numpy interp path uses.

Add tests that pin dask aggregate min/max/median to bit-equal eager
numpy for chunk-spanning windows and for arrays with extreme values on
the global edges.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 4, 2026
@brendancol brendancol merged commit 8b860d1 into main May 4, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix dask aggregate boundary contamination and clean up cumulative bookkeeping

1 participant