Fix dask aggregate boundary contamination and clean up cumulative bookkeeping#1477
Merged
Conversation
…1469) Switch the aggregate dask path's overlap to boundary=np.nan so any pad cells the kernels might read are skipped naturally (the kernels already ignore NaN). Compute the depth-driven minimum chunk size up front, combine it with the scale-driven minimum, and call _ensure_min_chunksize once -- removing the wasted first cumulative-array compute and the roundabout chunk-equality recompute branch. Mirror the same change in _run_dask_cupy. Leave the interp dask paths on boundary='nearest' so they keep matching scipy's mode='nearest' semantics that the eager numpy interp path uses. Add tests that pin dask aggregate min/max/median to bit-equal eager numpy for chunk-spanning windows and for arrays with extreme values on the global edges.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1469.
boundary=np.nan(cupy.nanin the cupy mirror). The aggregate kernels already skip NaN inputs and return NaN for empty windows, so the pad is ignored. The existing_agg_block_npindexing happens to keep output windows out of the global-edge pad, but the NaN boundary makes that contract explicit instead of relying on it.min_sizetogether, call_ensure_min_chunksizeonce, then build the cumulative arrays once. Removes the wasted first compute and the recompute branch that useddata.chunks[0] != tuple(cum_in_y[1:] - cum_in_y[:-1])as a roundabout chunk-equality check.boundary='nearest'so they keep matching the scipymode='nearest'semantics that the eager numpy interp path uses.Test plan
pytest xrspatial/tests/test_resample.pypasses (68 tests: 62 existing + 6 new).TestAggregateDaskBoundarypins dask aggregate min/max/median to bit-equal eager numpy:test_chunk_spanning_window_bit_identical: chunks (7, 7) on a 24x24 input force output windows to span chunk boundaries.test_global_edge_extremes_match_eager: 999 and -999 on every outer row/column.average,min,max,median,mode.Note on the boundary change
The
_agg_block_npindexing reads fromblock[depth_y : depth_y + chunk_size, ...], which never touches the global-edge pad. Soboundary='nearest'was not actually corrupting results in the current code, and the new tests would not have failed pre-fix. The change is still worth making:_agg_block_npto walk further into the block, NaN padding fails safely while'nearest'silently biases min/max/median.map_overlapcall site in xrspatial sets its boundary.The bookkeeping cleanup is the substantive fix; the boundary change is defence-in-depth.