Skip to content

Merge sink_d8 labels across dask tile boundaries (#1394)#1395

Merged
brendancol merged 1 commit into
mainfrom
issue-1394
Apr 30, 2026
Merged

Merge sink_d8 labels across dask tile boundaries (#1394)#1395
brendancol merged 1 commit into
mainfrom
issue-1394

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Fixes #1394.

Summary

  • _run_dask_numpy in xrspatial/hydro/sink_d8.py ran per-tile CCL with globally unique IDs but never merged equivalent labels across tile boundaries, so a single connected sink that spanned a chunk showed up as several separate sinks.
  • Added a union-find pass that walks every interior tile edge (4-connected) plus both diagonals (so 8-connectivity is preserved across corner-shared tiles), records label equivalences, and remaps labels to their roots via a second map_blocks pass.
  • The dask+cupy path delegates to the numpy dask path, so it inherits the fix.

Test plan

  • pytest xrspatial/hydro/tests/test_sink_d8.py — 40 passing (21 original + 19 new)
  • pytest xrspatial/hydro/tests/ — full hydro suite still passes (772 tests)
  • Original reproducer from sink_d8 dask backend splits connected sinks across tile boundaries #1394 now matches the numpy result
  • Added regression tests for horizontal, vertical, diagonal, four-tile-block, and separation cases at multiple chunk shapes
  • Added a _label_count_matches_numpy test so the number of unique sink labels stays equal across backends

Notes

  • Cross-tile merging needs a global view of the labeled raster, so the implementation calls .compute() once on the per-tile result. The streaming benefit of dask is preserved during the per-tile CCL phase; only the boundary scan and label remap require the materialized array. CCL is fundamentally a global operation, so this matches what xrspatial/sieve.py already does for its dask path.
  • Labels in the dask output are not byte-identical to the numpy output (the dask path uses position-based IDs from each tile, then merges, while the numpy path uses one position-based ID across the whole raster). The new tests check label-partition equivalence (cells in the same numpy component are in the same dask component) rather than literal ID equality.

The per-tile CCL in _run_dask_numpy assigned globally unique IDs but
never merged equivalent labels across tile boundaries, so a single
connected sink that spanned a chunk ended up as several separate
sinks. Add a union-find pass over boundary equivalences (4-connected
edges plus the two diagonals for 8-connectivity) and remap labels
to their roots. Cover horizontal, vertical, diagonal, four-tile
block, and separation cases in test_sink_d8.py.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Apr 30, 2026
@brendancol brendancol merged commit 7653275 into main Apr 30, 2026
11 checks passed
@brendancol brendancol deleted the issue-1394 branch May 4, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sink_d8 dask backend splits connected sinks across tile boundaries

1 participant