Describe the bug
sink_d8's dask backend gives different labels to a single connected sink when it spans more than one tile. The numpy backend gets it right and assigns one label per connected component.
The per-tile CCL inside _run_dask_numpy produces globally-unique IDs but never merges equivalent labels across tile boundaries. So a single physical sink that straddles a chunk boundary shows up as two (or more) separate sinks in the output. The dask+cupy path delegates to the numpy dask path, so it has the same bug.
Reproducer
import numpy as np
import xarray as xr
import dask.array as da
from xrspatial.hydro import sink
flow_dir = np.array([[1.0, 0.0, 0.0, 16.0]], dtype=np.float64)
# numpy: single connected sink at cells (0,1) and (0,2)
agg_np = xr.DataArray(
flow_dir, dims=['y', 'x'],
coords={'y': [0.0], 'x': [0., 1., 2., 3.]},
attrs={'res': (1.0, 1.0)},
)
# dask with chunks (1, 2): split between cols 1 and 2
dk = da.from_array(flow_dir, chunks=(1, 2))
agg_dk = xr.DataArray(
dk, dims=['y', 'x'],
coords={'y': [0.0], 'x': [0., 1., 2., 3.]},
attrs={'res': (1.0, 1.0)},
)
print(sink(agg_np).data) # [[nan 2. 2. nan]] - same label
print(sink(agg_dk).compute().data) # [[nan 2. 3. nan]] - different labels
Expected behavior
Dask labels should match numpy labels for any chunking. Two cells in the same connected sink should always share a label, regardless of how the raster is partitioned.
Test gap
The dask tests in xrspatial/hydro/tests/test_sink_d8.py only cover sinks that fit inside a single tile (test_dask_isolated_sinks) or check that NaN positions match across backends (test_dask_nan_positions). Nothing exercises a sink whose connected component crosses a tile boundary, which is why CI missed this.
Fix sketch
Run a union-find pass after per-tile CCL: walk each shared tile edge, record an equivalence whenever two adjacent cells are both sinks, then remap labels to their roots. xrspatial/sieve.py's _label_connected already uses union-find for in-tile CCL, so the pattern is familiar in this codebase. The dask streaming behavior stays intact.
Categories: 5 (Backend Inconsistency).
Describe the bug
sink_d8's dask backend gives different labels to a single connected sink when it spans more than one tile. The numpy backend gets it right and assigns one label per connected component.The per-tile CCL inside
_run_dask_numpyproduces globally-unique IDs but never merges equivalent labels across tile boundaries. So a single physical sink that straddles a chunk boundary shows up as two (or more) separate sinks in the output. The dask+cupy path delegates to the numpy dask path, so it has the same bug.Reproducer
Expected behavior
Dask labels should match numpy labels for any chunking. Two cells in the same connected sink should always share a label, regardless of how the raster is partitioned.
Test gap
The dask tests in
xrspatial/hydro/tests/test_sink_d8.pyonly cover sinks that fit inside a single tile (test_dask_isolated_sinks) or check that NaN positions match across backends (test_dask_nan_positions). Nothing exercises a sink whose connected component crosses a tile boundary, which is why CI missed this.Fix sketch
Run a union-find pass after per-tile CCL: walk each shared tile edge, record an equivalence whenever two adjacent cells are both sinks, then remap labels to their roots.
xrspatial/sieve.py's_label_connectedalready uses union-find for in-tile CCL, so the pattern is familiar in this codebase. The dask streaming behavior stays intact.Categories: 5 (Backend Inconsistency).