Skip to content

sink_d8 dask backend splits connected sinks across tile boundaries #1394

@brendancol

Description

@brendancol

Describe the bug

sink_d8's dask backend gives different labels to a single connected sink when it spans more than one tile. The numpy backend gets it right and assigns one label per connected component.

The per-tile CCL inside _run_dask_numpy produces globally-unique IDs but never merges equivalent labels across tile boundaries. So a single physical sink that straddles a chunk boundary shows up as two (or more) separate sinks in the output. The dask+cupy path delegates to the numpy dask path, so it has the same bug.

Reproducer

import numpy as np
import xarray as xr
import dask.array as da
from xrspatial.hydro import sink

flow_dir = np.array([[1.0, 0.0, 0.0, 16.0]], dtype=np.float64)

# numpy: single connected sink at cells (0,1) and (0,2)
agg_np = xr.DataArray(
    flow_dir, dims=['y', 'x'],
    coords={'y': [0.0], 'x': [0., 1., 2., 3.]},
    attrs={'res': (1.0, 1.0)},
)

# dask with chunks (1, 2): split between cols 1 and 2
dk = da.from_array(flow_dir, chunks=(1, 2))
agg_dk = xr.DataArray(
    dk, dims=['y', 'x'],
    coords={'y': [0.0], 'x': [0., 1., 2., 3.]},
    attrs={'res': (1.0, 1.0)},
)

print(sink(agg_np).data)              # [[nan 2. 2. nan]] - same label
print(sink(agg_dk).compute().data)    # [[nan 2. 3. nan]] - different labels

Expected behavior

Dask labels should match numpy labels for any chunking. Two cells in the same connected sink should always share a label, regardless of how the raster is partitioned.

Test gap

The dask tests in xrspatial/hydro/tests/test_sink_d8.py only cover sinks that fit inside a single tile (test_dask_isolated_sinks) or check that NaN positions match across backends (test_dask_nan_positions). Nothing exercises a sink whose connected component crosses a tile boundary, which is why CI missed this.

Fix sketch

Run a union-find pass after per-tile CCL: walk each shared tile edge, record an equivalence whenever two adjacent cells are both sinks, then remap labels to their roots. xrspatial/sieve.py's _label_connected already uses union-find for in-tile CCL, so the pattern is familiar in this codebase. The dask streaming behavior stays intact.

Categories: 5 (Backend Inconsistency).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions